cli

Fill in a module description here

clean_dataset

 clean_dataset (dataset_name:str, column_name:str,
                benchmark_configs_path:str, output_path:str,
                dataset_config_name:str=None, data_dir:str=None,
                dataset_split:str='train', save_json:bool=False)

Clean a dataset using a benchmark configuration file.

	Type	Default	Details
dataset_name	str		Name of the dataset to clean
column_name	str		Name of the column to clean
benchmark_configs_path	str		Path to the benchmark configuration file
output_path	str		Path to where the cleaned dataset will be saved
dataset_config_name	str	None	Name of the dataset configuration to use
data_dir	str	None	Path to the data files to use
dataset_split	str	train	Name of the dataset split to clean
save_json	bool	False	Whether to save the cleaned dataset as a JSON file

clean_dataset(
    dataset_name="bigcode/the-stack-smol",
    column_name="content",
    benchmark_configs_path=temp.name,
    output_path="/tmp/test.jsonl",
    data_dir="data/python",
    dataset_split="train",
    save_json=True,
)

Checking for false positives...: 100%|██████████| 8780/8780 [00:33<00:00, 260.81it/s]
Checking for false positives...: 100%|██████████| 8674/8674 [01:07<00:00, 129.20it/s]

[11/06/22 10:39:25] INFO     Data Number                   : 10000                                      core.py:210

                    INFO     Duplicate Number              : 4612                                       core.py:211

                    INFO     Duplicate Rate                : 46.12%                                     core.py:212

                    INFO     Total Time                    : 104.49 seconds                             core.py:213

Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00,  1.77ba/s]