source
clean_dataset
clean_dataset (dataset_name:str, column_name:str,
benchmark_configs_path:str, output_path:str,
dataset_config_name:str=None, data_dir:str=None,
dataset_split:str='train', save_json:bool=False)
Clean a dataset using a benchmark configuration file.
dataset_name |
str |
|
Name of the dataset to clean |
column_name |
str |
|
Name of the column to clean |
benchmark_configs_path |
str |
|
Path to the benchmark configuration file |
output_path |
str |
|
Path to where the cleaned dataset will be saved |
dataset_config_name |
str |
None |
Name of the dataset configuration to use |
data_dir |
str |
None |
Path to the data files to use |
dataset_split |
str |
train |
Name of the dataset split to clean |
save_json |
bool |
False |
Whether to save the cleaned dataset as a JSON file |
clean_dataset(
dataset_name="bigcode/the-stack-smol",
column_name="content",
benchmark_configs_path=temp.name,
output_path="/tmp/test.jsonl",
data_dir="data/python",
dataset_split="train",
save_json=True,
)
Checking for false positives...: 100%|██████████| 8780/8780 [00:33<00:00, 260.81it/s]
Checking for false positives...: 100%|██████████| 8674/8674 [01:07<00:00, 129.20it/s]
[11/06/22 10:39:25] INFO Data Number : 10000 core.py:210
Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 1.77ba/s]