TikTokenBaseTokenizer¶
- class torchtune.modules.tokenizers.TikTokenBaseTokenizer(path: str, name: str, pattern: str, bos_id: int, eos_id: int, special_tokens: Dict[str, int])[源代码]¶
tiktoken Encoding 的轻量级封装。此类还处理将输入文本分解为最大长度的子字符串,并拆分长重复项以提高编码速度。
- 参数:
示例
>>> tokenizer = TikTokenBaseTokenizer("/path/to/tt_model") >>> tokenized_text = tokenizer.encode("Hello world!", add_bos=True, add_eos=True) >>> print(tokenized_text) [1, 31587, 29644, 102, 2]