core

Fill in a module description here

hash_content

 hash_content (idx:int, content:str, num_perm:int)

Hash the content of a record using MinHash. This function should be used with multiprocessing and it scales well with the number of cores.

	Type	Details
idx	int	index of the document
content	str	content of the document
num_perm	int

result = hash_content(0, "Hello world!", num_perm=128)
assert result["__id__"] == 0
assert result["__signature__"].shape == (128,)
assert result["__signature__"].dtype == np.dtype('uint64')

source

query_content

 query_content (idx:int, signature:numpy.ndarray,
                index:datasketch.lsh.MinHashLSH)

Query the MinHashLSH index for the record. This function can be used with multiprocessing as long as the index is shared across processes. Parameters.

	Type	Details
idx	int	index of the document
signature	ndarray	MinHash signature of the document
index	MinHashLSH

source

jaccard_similarity

 jaccard_similarity (s1:str, s2:str)

Calculate the jaccard similarity between two code snippets.

	Type	Details
s1	str	The first string to compare.
s2	str	The second string to compare.
Returns	float	The Jaccard similarity between the two strings.

assert jaccard_similarity("a = 1", "a = 2") == 0.3333333333333333
assert jaccard_similarity("a = 1", "a = 1") == 1.0

source

convert_list_to_dict

 convert_list_to_dict (list)

source

config_lists

 config_lists (name)

os.environ["HF_ACCESS_TOKEN"] = "<TOKEN>"
ds_dict = config_lists("amazon_reviews_multi")
ds_dict

{'all_languages': ['validation', 'test']}

source

process_ds_config

 process_ds_config (name, ds_dict)

ds = next(process_ds_config("amazon_reviews_multi", ds_dict))
ds.features

source

BenchmarkCleaner

 BenchmarkCleaner (benchmark_names:list, threshold:float=0.5,
                   num_perm:int=128)

A class to clean the benchmark dataset.

	Type	Default	Details
benchmark_names	list		The list of benchmark names to clean.
threshold	float	0.5	The threshold to use for the MinHashLSH index.
num_perm	int	128	The number of permutations to use for the MinHashLSH index.

benchmark_names = ["openai_humaneval", "mbpp"]
ds = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")
bench_cleaner = BenchmarkCleaner(benchmark_names, threshold=0.1, num_perm=128)
ds = bench_cleaner.clean(ds, "content")

[01/12/23 05:06:31] INFO     Data Number                   : 10000                                3890709806.py:127

                    INFO     Duplicate Number              : 4611                                 3890709806.py:128

                    INFO     Duplicate Rate                : 46.11%                               3890709806.py:129

                    INFO     Total Time                    : 151.11 seconds                       3890709806.py:130