= hash_content(0, "Hello world!", num_perm=128)
result assert result["__id__"] == 0
assert result["__signature__"].shape == (128,)
assert result["__signature__"].dtype == np.dtype('uint64')
core
Fill in a module description here
hash_content
hash_content (idx:int, content:str, num_perm:int)
Hash the content of a record using MinHash. This function should be used with multiprocessing and it scales well with the number of cores.
Type | Details | |
---|---|---|
idx | int | index of the document |
content | str | content of the document |
num_perm | int |
query_content
query_content (idx:int, signature:numpy.ndarray, index:datasketch.lsh.MinHashLSH)
Query the MinHashLSH index for the record. This function can be used with multiprocessing as long as the index is shared across processes. Parameters.
Type | Details | |
---|---|---|
idx | int | index of the document |
signature | ndarray | MinHash signature of the document |
index | MinHashLSH |
jaccard_similarity
jaccard_similarity (s1:str, s2:str)
Calculate the jaccard similarity between two code snippets.
Type | Details | |
---|---|---|
s1 | str | The first string to compare. |
s2 | str | The second string to compare. |
Returns | float | The Jaccard similarity between the two strings. |
assert jaccard_similarity("a = 1", "a = 2") == 0.3333333333333333
assert jaccard_similarity("a = 1", "a = 1") == 1.0
convert_list_to_dict
convert_list_to_dict (list)
config_lists
config_lists (name)
"HF_ACCESS_TOKEN"] = "<TOKEN>"
os.environ[= config_lists("amazon_reviews_multi")
ds_dict ds_dict
{'all_languages': ['validation', 'test']}
process_ds_config
process_ds_config (name, ds_dict)
= next(process_ds_config("amazon_reviews_multi", ds_dict))
ds ds.features
BenchmarkCleaner
BenchmarkCleaner (benchmark_names:list, threshold:float=0.5, num_perm:int=128)
A class to clean the benchmark dataset.
Type | Default | Details | |
---|---|---|---|
benchmark_names | list | The list of benchmark names to clean. | |
threshold | float | 0.5 | The threshold to use for the MinHashLSH index. |
num_perm | int | 128 | The number of permutations to use for the MinHashLSH index. |
= ["openai_humaneval", "mbpp"]
benchmark_names = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")
ds = BenchmarkCleaner(benchmark_names, threshold=0.1, num_perm=128)
bench_cleaner = bench_cleaner.clean(ds, "content") ds
[01/12/23 05:06:31] INFO Data Number : 10000 3890709806.py:127
INFO Duplicate Number : 4611 3890709806.py:128
INFO Duplicate Rate : 46.11% 3890709806.py:129
INFO Total Time : 151.11 seconds 3890709806.py:130