cli

Fill in a module description here

source

clean_dataset

 clean_dataset (dataset_name:str, column_name:str,
                benchmark_configs_path:str, output_path:str,
                dataset_config_name:str=None, data_dir:str=None,
                dataset_split:str='train', save_json:bool=False)

Clean a dataset using a benchmark configuration file.

Type Default Details
dataset_name str Name of the dataset to clean
column_name str Name of the column to clean
benchmark_configs_path str Path to the benchmark configuration file
output_path str Path to where the cleaned dataset will be saved
dataset_config_name str None Name of the dataset configuration to use
data_dir str None Path to the data files to use
dataset_split str train Name of the dataset split to clean
save_json bool False Whether to save the cleaned dataset as a JSON file
clean_dataset(
    dataset_name="bigcode/the-stack-smol",
    column_name="content",
    benchmark_configs_path=temp.name,
    output_path="/tmp/test.jsonl",
    data_dir="data/python",
    dataset_split="train",
    save_json=True,
)
Checking for false positives...: 100%|██████████| 8780/8780 [00:33<00:00, 260.81it/s]
Checking for false positives...: 100%|██████████| 8674/8674 [01:07<00:00, 129.20it/s]
[11/06/22 10:39:25] INFO     Data Number                   : 10000                                      core.py:210
                    INFO     Duplicate Number              : 4612                                       core.py:211
                    INFO     Duplicate Rate                : 46.12%                                     core.py:212
                    INFO     Total Time                    : 104.49 seconds                             core.py:213
Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00,  1.77ba/s]