EduNLP.Tokenizer

class EduNLP.Tokenizer.AstFormulaTokenizer(symbol='gmas', figures=None, **kwargs)[source]
class EduNLP.Tokenizer.CharTokenizer(stop_words='punctuations', **kwargs)[source]
class EduNLP.Tokenizer.CustomTokenizer(symbol='gmas', figures=None, **kwargs)[source]
class EduNLP.Tokenizer.PureTextTokenizer(handle_figure_formula='skip', **kwargs)[source]
class EduNLP.Tokenizer.SpaceTokenizer(stop_words='punctuations', **kwargs)[source]

Tokenize text by space. eg. “题目 内容” -> [“题目”, “内容”]

Parameters

stop_words (str, optional) – stop_words to skip, by default “default”

class EduNLP.Tokenizer.Tokenizer[source]

Iterator genetator for tokenization

EduNLP.Tokenizer.get_tokenizer(name, *args, **kwargs)[source]

It is a total interface to use difference tokenizer. :param name: the name of tokenizer, e.g. text, pure_text. :type name: str :param args: the parameters passed to tokenizer :param kwargs: the parameters passed to tokenizer

Returns

tokenizer

Return type

Tokenizer

Examples

>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokenizer = get_tokenizer("pure_text")
>>> tokens = tokenizer(items)
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']