result = hash_content(0, "Hello world!", num_perm=128)
assert result["__id__"] == 0
assert result["__signature__"].shape == (128,)
assert result["__signature__"].dtype == np.dtype('uint64')core
Fill in a module description here
hash_content
hash_content (idx:int, content:str, num_perm:int)
Hash the content of a record using MinHash. This function should be used with multiprocessing and it scales well with the number of cores.
| Type | Details | |
|---|---|---|
| idx | int | index of the document |
| content | str | content of the document |
| num_perm | int |
query_content
query_content (idx:int, signature:numpy.ndarray, index:datasketch.lsh.MinHashLSH)
Query the MinHashLSH index for the record. This function can be used with multiprocessing as long as the index is shared across processes. Parameters.
| Type | Details | |
|---|---|---|
| idx | int | index of the document |
| signature | ndarray | MinHash signature of the document |
| index | MinHashLSH |
jaccard_similarity
jaccard_similarity (s1:str, s2:str)
Calculate the jaccard similarity between two code snippets.
| Type | Details | |
|---|---|---|
| s1 | str | The first string to compare. |
| s2 | str | The second string to compare. |
| Returns | float | The Jaccard similarity between the two strings. |
assert jaccard_similarity("a = 1", "a = 2") == 0.3333333333333333
assert jaccard_similarity("a = 1", "a = 1") == 1.0convert_list_to_dict
convert_list_to_dict (list)
config_lists
config_lists (name)
os.environ["HF_ACCESS_TOKEN"] = "<TOKEN>"
ds_dict = config_lists("amazon_reviews_multi")
ds_dict{'all_languages': ['validation', 'test']}
process_ds_config
process_ds_config (name, ds_dict)
ds = next(process_ds_config("amazon_reviews_multi", ds_dict))
ds.featuresBenchmarkCleaner
BenchmarkCleaner (benchmark_names:list, threshold:float=0.5, num_perm:int=128)
A class to clean the benchmark dataset.
| Type | Default | Details | |
|---|---|---|---|
| benchmark_names | list | The list of benchmark names to clean. | |
| threshold | float | 0.5 | The threshold to use for the MinHashLSH index. |
| num_perm | int | 128 | The number of permutations to use for the MinHashLSH index. |
benchmark_names = ["openai_humaneval", "mbpp"]
ds = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")
bench_cleaner = BenchmarkCleaner(benchmark_names, threshold=0.1, num_perm=128)
ds = bench_cleaner.clean(ds, "content")
[01/12/23 05:06:31] INFO Data Number : 10000 3890709806.py:127
INFO Duplicate Number : 4611 3890709806.py:128
INFO Duplicate Rate : 46.11% 3890709806.py:129
INFO Total Time : 151.11 seconds 3890709806.py:130