This library is built on top of the awesome transformers and tree-sitter libraries. It provides a simple interface to align the tokens produced by a BPE tokenizer with the tokens produced by a tree-sitter parser.
Install
pip install code_tokenizers
How to use
The main interface of code_tokenizers is the CodeTokenizer class. You can use a pretrained BPE tokenizer from the popular transformers library, and a tree-sitter parser from the tree-sitter library.
To specify a CodeTokenizer using the gpt2 BPE tokenizer and the python tree-sitter parser, you can do:
from code_tokenizers.core import CodeTokenizerpy_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You can specify any pretrained BPE tokenizer from the huggingface hub or a local directory and the language to parse the AST for.