EduNLP.Tokenizer¶
- class EduNLP.Tokenizer.SpaceTokenizer(stop_words='punctuations', **kwargs)[source]¶
Tokenize text by space. eg. “题目 内容” -> [“题目”, “内容”]
- Parameters
stop_words (str, optional) – stop_words to skip, by default “default”
- EduNLP.Tokenizer.get_tokenizer(name, *args, **kwargs)[source]¶
It is a total interface to use difference tokenizer. :param name: the name of tokenizer, e.g. text, pure_text. :type name: str :param args: the parameters passed to tokenizer :param kwargs: the parameters passed to tokenizer
- Returns
tokenizer
- Return type
Examples
>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] >>> tokenizer = get_tokenizer("pure_text") >>> tokens = tokenizer(items) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '=']