core

Fill in a module description here

source

hash_content

 hash_content (idx:int, content:str, num_perm:int)

Hash the content of a record using MinHash. This function should be used with multiprocessing and it scales well with the number of cores.

Type Details
idx int index of the document
content str content of the document
num_perm int
result = hash_content(0, "Hello world!", num_perm=128)
assert result["__id__"] == 0
assert result["__signature__"].shape == (128,)
assert result["__signature__"].dtype == np.dtype('uint64')

source

query_content

 query_content (idx:int, signature:numpy.ndarray,
                index:datasketch.lsh.MinHashLSH)

Query the MinHashLSH index for the record. This function can be used with multiprocessing as long as the index is shared across processes. Parameters.

Type Details
idx int index of the document
signature ndarray MinHash signature of the document
index MinHashLSH

source

jaccard_similarity

 jaccard_similarity (s1:str, s2:str)

Calculate the jaccard similarity between two code snippets.

Type Details
s1 str The first string to compare.
s2 str The second string to compare.
Returns float The Jaccard similarity between the two strings.
assert jaccard_similarity("a = 1", "a = 2") == 0.3333333333333333
assert jaccard_similarity("a = 1", "a = 1") == 1.0

source

convert_list_to_dict

 convert_list_to_dict (list)

source

config_lists

 config_lists (name)
os.environ["HF_ACCESS_TOKEN"] = "<TOKEN>"
ds_dict = config_lists("amazon_reviews_multi")
ds_dict
{'all_languages': ['validation', 'test']}

source

process_ds_config

 process_ds_config (name, ds_dict)
ds = next(process_ds_config("amazon_reviews_multi", ds_dict))
ds.features

source

BenchmarkCleaner

 BenchmarkCleaner (benchmark_names:list, threshold:float=0.5,
                   num_perm:int=128)

A class to clean the benchmark dataset.

Type Default Details
benchmark_names list The list of benchmark names to clean.
threshold float 0.5 The threshold to use for the MinHashLSH index.
num_perm int 128 The number of permutations to use for the MinHashLSH index.
benchmark_names = ["openai_humaneval", "mbpp"]
ds = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")
bench_cleaner = BenchmarkCleaner(benchmark_names, threshold=0.1, num_perm=128)
ds = bench_cleaner.clean(ds, "content")
                                                                        
                                             
 
  
  
 
    
 
 
   
  
 
 
                                   
 
 
 
 
 
  
 
   
 
 
 
 
 
 
 
 
  
        
       
 
                                             
 
 
   
  
 
  
 
  
 
   
 
 
                                   
 
 
 
 
  
  
 
 
 
 
 
 
 
 
  
 
  
  
  
  
 
 
       
 
                                           
  
  
   
 
  
 
  
 
 
    
  
                                   
 
 
 
 
 
 
 
 
 
  
  
 
 
 
  
  
 
   
 
 
  
 
       
 
                                                
  
   
  
  
  
 
  
  
                                    
 
 
 
 
 
 
  
 
 
 
 
 
  
  
 
 
 
   
 
   
 
       
 
                                            
     
  
   
  
  
   
 
 
 
                                   
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
  
 
  
 
   
  
        
                                            
  
  
 
 
      
 
 
   
 
 
 
                                   
 
 
 
 
 
 
 
 
 
 
  
 
  
 
 
 
  
 
    
   
 
        
                                             
 
 
  
    
 
 
 
 
     
 
 
                                   
 
 
 
  
 
 
  
 
  
 
 
 
 
 
 
  
 
    
 
 
 
 
                                
[01/12/23 05:06:31] INFO     Data Number                   : 10000                                3890709806.py:127
                    INFO     Duplicate Number              : 4611                                 3890709806.py:128
                    INFO     Duplicate Rate                : 46.11%                               3890709806.py:129
                    INFO     Total Time                    : 151.11 seconds                       3890709806.py:130