code_tokenizers

Aligning BPE and AST

This library is built on top of the awesome transformers and tree-sitter libraries. It provides a simple interface to align the tokens produced by a BPE tokenizer with the tokens produced by a tree-sitter parser.

Install

pip install code_tokenizers

How to use

The main interface of code_tokenizers is the CodeTokenizer class. You can use a pretrained BPE tokenizer from the popular transformers library, and a tree-sitter parser from the tree-sitter library.

To specify a CodeTokenizer using the gpt2 BPE tokenizer and the python tree-sitter parser, you can do:

from code_tokenizers.core import CodeTokenizer

py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

You can specify any pretrained BPE tokenizer from the huggingface hub or a local directory and the language to parse the AST for.

Now, we can tokenize some code:

from pprint import pprint

code = """
def foo():
    print("Hello world!")
"""

encoding = py_tokenizer(code)
pprint(encoding, depth=1)

{'ast_ids': [...],
 'attention_mask': [...],
 'input_ids': [...],
 'is_builtins': [...],
 'is_internal_methods': [...],
 'merged_ast': [...],
 'offset_mapping': [...],
 'parent_ast_ids': [...]}

And we can print out the associated AST types:

Note

Note: Here the N/As are the tokens that are not part of the AST, such as the spaces and the newline characters. Their IDs are set to -1.

for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
    if ast_id != -1:
        print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
    else:
        print("N/A")

N/A
function_definition def
function_definition identifier
parameters (
N/A
N/A
N/A
N/A
call identifier
argument_list (
argument_list string
argument_list string
argument_list string
argument_list )
N/A