EduNLP¶
SIF¶
- EduNLP.SIF.sif.is_sif(item, check_formula=True, return_parser=False)[source]¶
the part aims to check whether the input is sif format
- Parameters
item (str) – a raw item which respects stem
check_formula (bool) –
whether to check the formulas when parsing item.
True if check the validity of formulas in item False if not check the validity of formulas in item, which is faster
return_parser (bool) –
whether to put the parsed item in return.
when True, the format of return is (bool, Parser) when False, the format of return is bool
- Returns
when item can not be parsed correctly, raise ValueError; when item is in stardarded format originally, return Ture (and the Parser of item); when item isn’t in stardarded format originally, return False (and the Parser of item);
- Return type
bool
Examples
>>> text = '若$x,y$满足约束条件' \ ... '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$,' \ ... '则$z=x+7 y$的最大值$\\SIFUnderline$' >>> is_sif(text) True >>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...' >>> ret = is_sif(text, return_parser=True) >>> ret (False, <EduNLP.SIF.parser.parser.Parser object...>)
- EduNLP.SIF.sif.to_sif(item, check_formula=True, parser: Optional[Parser] = None)[source]¶
the part aims to switch item to sif formate
- Parameters
items (str) – a raw item which respects stem
check_formula (bool) – whether to check the formulas when parsing item (only work when parser=None).
parser (Parser) – the parser of item returned from is_sif.
- Returns
item – the item which accords with sif format
- Return type
str
Examples
>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...' >>> siftext = to_sif(text) >>> siftext '某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...' >>> ret = is_sif(text, return_parser=True) >>> ret (False, <EduNLP.SIF.parser.parser.Parser object...>) >>> to_sif(text, parser=ret[1]) '某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...
- EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, mode: int = 2, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]¶
Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params
- Parameters
item (str) – a raw item which respects stem
figures (dict) – when it is a dict, it means the id-to-instance for figures in ‘FormFigureID{…}’ format, when it is a bool, it means whether to instantiate figures in ‘FormFigureBase64{…}’ format
mode (int) – when safe = 2, use is_sif and check formula in item when safe = 1, use is_sif but don’t check formula in item when safe = 0, don’t use is_sif and don’t check anything in item
symbol (str) –
- select the methods to symbolize:
”t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep
tokenization (bool) – whether to tokenize item after segmentation
tokenization_params –
the dict of text_params, formula_params and figure_params in tokenization For formula_params:
method: which tokenizer to be used, “linear” or “ast” The parameters only useful for “linear”:
skip_figure_formula: whether to skip the formula in figure format symbolize_figure_formula: whether to symbolize the formula in figure format
- The parameters only useful for “ast”:
ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables return_type: ‘list’ or ‘ast’
More parameters can be found in the definition in SIF.tokenization.formula
- For figure_params:
figure_instance:whether to return instance of figures in tokens
- For text_params: See definition in SIF.tokenization.text
granularity: word or char stopwords: default or None or list
errors – warn, raise, coerce, strict, ignore
- Returns
When tokenization is False, return SegmentList; When tokenization is True, return TokenList
- Return type
list
Examples
>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$" >>> tl = sif4sci(test_item) >>> tl ['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}] >>> tl.describe() {'t': 2, 'f': 2, 'g': 1, 'm': 1} >>> with tl.filter('fgm'): ... tl ['如图所示', '面积'] >>> with tl.filter(keep='t'): ... tl ['如图所示', '面积'] >>> with tl.filter(): ... tl ['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}] >>> tl.text_tokens ['如图所示', '面积'] >>> tl.formula_tokens ['\\bigtriangleup', 'ABC'] >>> tl.figure_tokens [\FigureID{1}] >>> tl.ques_mark_tokens ['\\SIFBlank'] >>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}}) ['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]'] >>> sif4sci(test_item, symbol="tfgm") ['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]'] >>> sif4sci(test_item, symbol="gm", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}}) ['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]'] >>> test_item_1 = { ... "stem": r"若$x=2$, $y=\sqrt{x}$,则下列说法正确的是$\SIFChoice$", ... "options": [r"$x < y$", r"$y = x$", r"$y < x$"] ... } >>> tls = [ ... sif4sci(e, symbol="gm", ... tokenization_params={ ... "formula_params": { ... "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True, ... "link_variable": False} ... }) ... for e in ([test_item_1["stem"]] + test_item_1["options"]) ... ] >>> tls[1:] [['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']] >>> link_formulas(*tls) >>> tls[1:] [['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']] >>> from EduNLP.utils import dict2str4sif >>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False) >>> test_item_1_str '$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$' >>> tl1 = sif4sci(test_item_1_str, symbol="gm", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}}) >>> tl1.get_segments()[0] ['\\SIFTag{stem}'] >>> tl1.get_segments()[1:3] [['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']] >>> tl1.get_segments(add_seg_type=False)[0:3] [['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']] >>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]} >>> test_item_2 {'options': ['$x < y$', '$y = x$', '$y < x$']} >>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False) >>> test_item_2_str '$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$' >>> tl2 = sif4sci(test_item_2_str, symbol="gms", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}}) >>> tl2 ['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x'] >>> tl2.get_segments(add_seg_type=False) [['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']] >>> tl2.get_segments(add_seg_type=False, drop="s") [['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']] >>> tl3 = sif4sci(test_item_1["stem"], symbol="gs") >>> tl3.text_segments [['说法', '正确']] >>> tl3.formula_segments [['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']] >>> tl3.figure_segments [] >>> tl3.ques_mark_segments [['\\SIFChoice']] >>> test_item_3 = r"已知$y=x$,则以下说法中$\textf{正确,b}$的是" >>> tl4 = sif4sci(test_item_3) Warning: there is some chinese characters in formula! >>> tl4.text_segments [['已知'], ['说法', '中', '正确']]
EduNLP.Formula¶
- EduNLP.Formula.ast.get_edges(forest)[source]¶
构造边集合
- Parameters
forest (List[Dict]) – 森林
- Returns
edges – 边集合
- Return type
list of tuple(src,dst,type)
- EduNLP.Formula.ast.ast(formula: (<class 'str'>, typing.List[typing.Dict]), index=0, forest_begin=0, father_tree=None, is_str=False)[source]¶
The origin code author is https://github.com/hxwujinze
- Parameters
formula (str or List[Dict]) – 公式字符串或通过katex解析得到的结构体
index (int) – 本子树在树上的位置
forest_begin (int) – 本树在森林中的起始位置
father_tree (List[Dict]) – 父亲树
is_str (bool) –
- Returns
tree (List[Dict]) – 重新解析形成的特征树
todo (finish all types)
Notes
Some functions are not supportd in
katexe.g.,- tag
\begin{equation} \tag{tagName} F=ma \end{equation}\begin{align} \tag{1} y=x+z \end{align}\tag*{hi} x+y^{2x}
- dddot
\frac{ \dddot y }{ x }
For more information, refer to katex support table
EduNLP.I2V¶
- class EduNLP.I2V.i2v.I2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]¶
It just a api, so you shouldn’t use it directly. If you want to get vector from item, you can use other model like D2V and W2V.
- Parameters
tokenizer (str) – the name of tokenizer. eg. bert, pure_text, …
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) –
True: use pretrained t2v model
False: use your own t2v model
model_dir (str) – local directionary for saving online pretrained models, work only when pretrained_t2v=True
kwargs – the parameters passed to t2v
Examples
>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$, ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点, ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"} >>> model_dir = "examples/test_model/d2v" >>> url, model_name, *args = get_pretrained_model_info('d2v_test_256') >>> (); path = get_data(url, model_dir); () (...) >>> path = path_append(path, os.path.basename(path) + '.bin', to_str=True) >>> i2v = D2V("pure_text", "d2v", filepath=path, pretrained_t2v=False) >>> i2v(item) ([array([ ...dtype=float32)], None)
- Returns
i2v model
- Return type
- property vector_size¶
- class EduNLP.I2V.i2v.D2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]¶
The model aims to transfer item to vector directly.
Bases¶
I2V
- param tokenizer
the tokenizer name
- type tokenizer
str
- param t2v
the name of token2vector model
- type t2v
str
- param args
the parameters passed to t2v
- param tokenizer_kwargs
the parameters passed to tokenizer
- type tokenizer_kwargs
dict
- param pretrained_t2v
True: use pretrained t2v model False: use your own t2v model
- type pretrained_t2v
bool
- param kwargs
the parameters passed to t2v
Examples
>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$, ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点, ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"} >>> model_dir = "examples/test_model/d2v" >>> url, model_name, *args = get_pretrained_model_info('d2v_test_256') >>> (); path = get_data(url, model_dir); () (...) >>> path = path_append(path, os.path.basename(path) + '.bin', to_str=True) >>> i2v = D2V("pure_text","d2v",filepath=path, pretrained_t2v = False) >>> i2v(item) ([array([ ...dtype=float32)], None)
- returns
i2v model
- rtype
I2V
- infer_vector(items, tokenize=True, key=<function D2V.<lambda>>, *args, **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is necessary to load model.
- Parameters
items (str) – the text of question
tokenize (bool) – True: tokenize the item
key (function) – determine how to get the text of each item
args – the parameters passed to t2v
kwargs – the parameters passed to t2v
- Returns
vector
- Return type
list
- class EduNLP.I2V.i2v.W2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]¶
The model aims to transfer tokens to vector.
Bases¶
I2V
- param tokenizer
the tokenizer name
- type tokenizer
str
- param t2v
the name of token2vector model
- type t2v
str
- param args
the parameters passed to t2v
- param tokenizer_kwargs
the parameters passed to tokenizer
- type tokenizer_kwargs
dict
- param pretrained_t2v
True: use pretrained t2v model False: use your own t2v model
- type pretrained_t2v
bool
- param kwargs
the parameters passed to t2v
Examples
>>> (); i2v = get_pretrained_i2v("w2v_test_256", "examples/test_model/w2v"); () (...) >>> item_vector, token_vector = i2v(["有学者认为:‘学习’,必须适应实际"]) >>> item_vector [array([...], dtype=float32)]
- returns
i2v model
- rtype
W2V
- infer_vector(items, tokenize=True, key=<function W2V.<lambda>>, *args, **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is necessary to load model.
- Parameters
items (str) – the text of question
tokenize (bool) – True: tokenize the item
key (function) – determine how to get the text of each item
args – the parameters passed to t2v
kwargs – the parameters passed to t2v
- Returns
vector
- Return type
list
- class EduNLP.I2V.i2v.Elmo(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]¶
The model aims to transfer item and tokens to vector with Elmo.
Bases¶
I2V
- param tokenizer
the tokenizer name
- type tokenizer
str
- param t2v
the name of token2vector model
- type t2v
str
- param args
the parameters passed to t2v
- param tokenizer_kwargs
the parameters passed to tokenizer
- type tokenizer_kwargs
dict
- param pretrained_t2v
True: use pretrained t2v model False: use your own t2v model
- type pretrained_t2v
bool
- param kwargs
the parameters passed to t2v
- returns
i2v model
- rtype
Elmo
- infer_vector(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict], str, dict], *args, key=<function Elmo.<lambda>>, **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is necessary to load model.
- Parameters
items (str or dict or list) – the item of question, or question list
return_tensors (str) – tensor type used in tokenizer
args – the parameters passed to t2v
kwargs – the parameters passed to t2v
- Returns
vector
- Return type
list
- class EduNLP.I2V.i2v.Bert(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]¶
The model aims to transfer item and tokens to vector with Bert.
Bases¶
I2V
- param tokenizer
the tokenizer name
- type tokenizer
str
- param t2v
the name of token2vector model
- type t2v
str
- param args
the parameters passed to t2v
- param tokenizer_kwargs
the parameters passed to tokenizer
- type tokenizer_kwargs
dict
- param pretrained_t2v
True: use pretrained t2v model False: use your own t2v model
- type pretrained_t2v
bool
- param kwargs
the parameters passed to t2v
- returns
i2v model
- rtype
Bert
- infer_vector(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict], str, dict], *args, key=<function Bert.<lambda>>, return_tensors='pt', **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is nesseary to load model.
- Parameters
items (str or dict or list) – the item of question, or question list
return_tensors (str) – tensor type used in tokenizer
args – the parameters passed to t2v
kwargs – the parameters passed to t2v
- Returns
vector
- Return type
list
- class EduNLP.I2V.i2v.DisenQ(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]¶
The model aims to transfer item and tokens to vector with DisenQ. Bases ——- I2V :param tokenizer: the tokenizer name :type tokenizer: str :param t2v: the name of token2vector model :type t2v: str :param args: the parameters passed to t2v :param tokenizer_kwargs: the parameters passed to tokenizer :type tokenizer_kwargs: dict :param pretrained_t2v: True: use pretrained t2v model
False: use your own t2v model
- Parameters
kwargs – the parameters passed to t2v
- Returns
i2v model
- Return type
- infer_vector(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict], str, dict], *args, key=<function DisenQ.<lambda>>, vector_type=None, **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is nesseary to load model. :param items: the item of question, or question list :type items: str or dict or list :param key: determine how to get the text of each item :type key: function :param args: the parameters passed to t2v :param kwargs: the parameters passed to t2v
- Returns
vector
- Return type
list
- class EduNLP.I2V.i2v.QuesNet(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]¶
The model aims to transfer item and tokens to vector with quesnet. Bases ——- I2V
- infer_vector(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict], str, dict], *args, key=<function QuesNet.<lambda>>, meta=['know_name'], **kwargs)[source]¶
It is a function to switch item to vector. And before using the function, it is nesseary to load model. :param items: the item of question, or question list :type items: str or dict or list :param tokenize: True: tokenize the item :type tokenize: bool, optional :param key: determine how to get the text of each item, by default lambdax: x :type key: function, optional :param meta: meta information, by default [‘know_name’] :type meta: list, optional :param args: the parameters passed to t2v :param kwargs: the parameters passed to t2v
- Returns
token embeddings
question embedding
- EduNLP.I2V.i2v.get_pretrained_i2v(name, model_dir='/home/docs/.EduNLP/model')[source]¶
It is a good idea if you want to switch item to vector earily.
- Parameters
name (str) – the name of item2vector model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512
model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’
- Returns
i2v model
- Return type
Examples
>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$, ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点, ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"} >>> (); i2v = get_pretrained_i2v("d2v_test_256", "examples/test_model/d2v"); () (...) >>> print(i2v(item)) ([array([ ...dtype=float32)], None)
EduNLP.Pretrain¶
- EduNLP.Pretrain.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]¶
- Parameters
items:str – the text of question
w2v_prefix –
embedding_dim (int) – vector_size
method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf
binary (model format) – True:bin; False:kv
train_params (dict) – the training parameters passed to model
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'
- class EduNLP.Pretrain.GensimWordTokenizer(symbol='gm', general=False)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g.: gm, fgm, gmas, fgmas
general (bool) –
True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.
False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] >>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
- class EduNLP.Pretrain.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g. gms, fgm
depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]
- class EduNLP.Pretrain.QuesNetTokenizer(vocab_path=None, meta_vocab_dir=None, img_dir: Optional[str] = None, max_length=250, tokenize_method='custom', symbol='mas', add_specials: Optional[list] = None, meta: Optional[List[str]] = None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', **kwargs)[source]¶
Examples
>>> tokenizer = QuesNetTokenizer(meta=['knowledge']) >>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$", ... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B", ... "knowledge": "['*', '-', '/']"}] >>> tokenizer.set_vocab(test_items, ... trim_min_count=1, key=lambda x: x["ques_content"], silent=True) >>> tokenizer.set_meta_vocab(test_items, silent=True) >>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items] >>> print(token_items[0].keys()) dict_keys(['seq_idx', 'meta_idx']) >>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"]) >>> print(len(token_items["seq_idx"])) 2
- set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True, silent=True)[source]¶
- Parameters
items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
silent –
- classmethod from_pretrained(tokenizer_config_dir, img_dir=None, **kwargs)[source]¶
Parameters:¶
- tokenizer_config_dir: str
must contain tokenizer_config.json and vocab.txt and meta_{meta_name}.txt
- img_dir: str
default None the path of image directory
- EduNLP.Pretrain.pretrain_quesnet(path, output_dir, img_dir=None, save_embs=False, train_params=None)[source]¶
pretrain quesnet
- Parameters
path (str) – path of question file
output_dir (str) – output path·
tokenizer (QuesNetTokenizer) – quesnet tokenizer
save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately
train_params (dict, optional) –
the training parameters and model parameters, by default None - “n_epochs”: int, default = 1
train param, number of epochs
- ”batch_size”: int, default = 6
train param, batch size
- ”lr”: float, default = 1e-3
train param, learning rate
- ”save_every”: int, default = 0
train param, save steps interval
- ”log_steps”: int, default = 10
train param, log steps interval
- ”device”: str, default = ‘cpu’
train param, ‘cpu’ or ‘cuda’
- ”max_steps”: int, default = 0
train param, stop training when reach max steps
- ”emb_size”: int, default = 256
model param, the embedding size of word, figure, meta info
- ”feat_size”: int, default = 256
model param, the size of question infer vector
Examples
>>> tokenizer = QuesNetTokenizer(meta=['know_name']) >>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$,则$|z|=$,$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$", ... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb", ... "know_name": "['代数', '集合', '集合的相等']" ... }] >>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True) >>> pretrain_quesnet('./data/standard_luna_data.json', './testQuesNet', tokenizer)
- class EduNLP.Pretrain.Question(id, content, answer, false_options, labels)¶
- property answer¶
Alias for field number 2
- property content¶
Alias for field number 1
- property false_options¶
Alias for field number 3
- property id¶
Alias for field number 0
- property labels¶
Alias for field number 4
- class EduNLP.Pretrain.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials: Optional[list] = None, num_token='[NUM]', **kwargs)[source]¶
Examples
>>> tokenizer = DisenQTokenizer() >>> test_items = [{ ... "content": "甲 数 除以 乙 数 的 商 是 1.5 , 如果 甲 数 增加 20 , 则 甲 数 是 乙 的 4 倍 . 原来 甲 数 = .", ... "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}] >>> tokenizer.set_vocab(test_items, ... key=lambda x: x["content"], trim_min_count=1) [['甲', '数', '除以', '乙', '数', '商', '[NUM]', '甲', '数', '增加', '[NUM]', '甲', '数', '乙', '倍', '甲', '数']] >>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items] >>> print(token_items[0].keys()) dict_keys(['seq_idx', 'seq_len'])
- class EduNLP.Pretrain.AutoTokenizer[source]¶
This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the [AutoTokenizer.from_pretrained] class method.
This class cannot be instantiated directly using __init__() (throws an error).
- classmethod from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)[source]¶
Instantiate one of the tokenizer classes of the library from a pretrained model vocabulary.
The tokenizer class to instantiate is selected based on the model_type property of the config object (either passed as an argument or loaded from pretrained_model_name_or_path if possible), or when it’s missing, by falling back to using pattern matching on pretrained_model_name_or_path:
albert – [AlbertTokenizerFast] (ALBERT model)
bart – [BartTokenizer] or [BartTokenizerFast] (BART model)
barthez – [BarthezTokenizerFast] (BARThez model)
bartpho – [BartphoTokenizer] (BARTpho model)
bert – [BertTokenizer] or [BertTokenizerFast] (BERT model)
bert-generation – (Bert Generation model)
bert-japanese – [BertJapaneseTokenizer] (BertJapanese model)
bertweet – [BertweetTokenizer] (BERTweet model)
big_bird – [BigBirdTokenizerFast] (BigBird model)
bigbird_pegasus – [PegasusTokenizer] or [PegasusTokenizerFast] (BigBird-Pegasus model)
blenderbot – [BlenderbotTokenizer] or [BlenderbotTokenizerFast] (Blenderbot model)
blenderbot-small – [BlenderbotSmallTokenizer] (BlenderbotSmall model)
bloom – [BloomTokenizerFast] (BLOOM model)
byt5 – [ByT5Tokenizer] (ByT5 model)
camembert – [CamembertTokenizerFast] (CamemBERT model)
canine – [CanineTokenizer] (CANINE model)
clip – [CLIPTokenizer] or [CLIPTokenizerFast] (CLIP model)
codegen – [CodeGenTokenizer] or [CodeGenTokenizerFast] (CodeGen model)
convbert – [ConvBertTokenizer] or [ConvBertTokenizerFast] (ConvBERT model)
cpm – [CpmTokenizerFast] (CPM model)
ctrl – [CTRLTokenizer] (CTRL model)
data2vec-text – [RobertaTokenizer] or [RobertaTokenizerFast] (Data2VecText model)
deberta – [DebertaTokenizer] or [DebertaTokenizerFast] (DeBERTa model)
deberta-v2 – [DebertaV2TokenizerFast] (DeBERTa-v2 model)
distilbert – [DistilBertTokenizer] or [DistilBertTokenizerFast] (DistilBERT model)
dpr – [DPRQuestionEncoderTokenizer] or [DPRQuestionEncoderTokenizerFast] (DPR model)
electra – [ElectraTokenizer] or [ElectraTokenizerFast] (ELECTRA model)
ernie – [BertTokenizer] or [BertTokenizerFast] (ERNIE model)
esm – [EsmTokenizer] (ESM model)
flaubert – [FlaubertTokenizer] (FlauBERT model)
fnet – [FNetTokenizer] or [FNetTokenizerFast] (FNet model)
fsmt – [FSMTTokenizer] (FairSeq Machine-Translation model)
funnel – [FunnelTokenizer] or [FunnelTokenizerFast] (Funnel Transformer model)
gpt2 – [GPT2Tokenizer] or [GPT2TokenizerFast] (OpenAI GPT-2 model)
gpt_neo – [GPT2Tokenizer] or [GPT2TokenizerFast] (GPT Neo model)
gpt_neox – [GPTNeoXTokenizerFast] (GPT NeoX model)
gpt_neox_japanese – [GPTNeoXJapaneseTokenizer] (GPT NeoX Japanese model)
gptj – [GPT2Tokenizer] or [GPT2TokenizerFast] (GPT-J model)
groupvit – [CLIPTokenizer] or [CLIPTokenizerFast] (GroupViT model)
herbert – [HerbertTokenizer] or [HerbertTokenizerFast] (HerBERT model)
hubert – [Wav2Vec2CTCTokenizer] (Hubert model)
ibert – [RobertaTokenizer] or [RobertaTokenizerFast] (I-BERT model)
layoutlm – [LayoutLMTokenizer] or [LayoutLMTokenizerFast] (LayoutLM model)
layoutlmv2 – [LayoutLMv2Tokenizer] or [LayoutLMv2TokenizerFast] (LayoutLMv2 model)
layoutlmv3 – [LayoutLMv3Tokenizer] or [LayoutLMv3TokenizerFast] (LayoutLMv3 model)
layoutxlm – [LayoutXLMTokenizer] or [LayoutXLMTokenizerFast] (LayoutXLM model)
led – [LEDTokenizer] or [LEDTokenizerFast] (LED model)
lilt – [LayoutLMv3Tokenizer] or [LayoutLMv3TokenizerFast] (LiLT model)
longformer – [LongformerTokenizer] or [LongformerTokenizerFast] (Longformer model)
longt5 – [T5TokenizerFast] (LongT5 model)
luke – [LukeTokenizer] (LUKE model)
lxmert – [LxmertTokenizer] or [LxmertTokenizerFast] (LXMERT model)
m2m_100 – (M2M100 model)
marian – (Marian model)
mbart – [MBartTokenizerFast] (mBART model)
mbart50 – [MBart50TokenizerFast] (mBART-50 model)
megatron-bert – [BertTokenizer] or [BertTokenizerFast] (Megatron-BERT model)
mluke – (mLUKE model)
mobilebert – [MobileBertTokenizer] or [MobileBertTokenizerFast] (MobileBERT model)
mpnet – [MPNetTokenizer] or [MPNetTokenizerFast] (MPNet model)
mt5 – [MT5TokenizerFast] (MT5 model)
mvp – [MvpTokenizer] or [MvpTokenizerFast] (MVP model)
nezha – [BertTokenizer] or [BertTokenizerFast] (Nezha model)
nllb – [NllbTokenizerFast] (NLLB model)
nystromformer – [AlbertTokenizerFast] (Nyströmformer model)
openai-gpt – [OpenAIGPTTokenizer] or [OpenAIGPTTokenizerFast] (OpenAI GPT model)
opt – [GPT2Tokenizer] (OPT model)
owlvit – [CLIPTokenizer] or [CLIPTokenizerFast] (OWL-ViT model)
pegasus – [PegasusTokenizerFast] (Pegasus model)
pegasus_x – [PegasusTokenizerFast] (PEGASUS-X model)
perceiver – [PerceiverTokenizer] (Perceiver model)
phobert – [PhobertTokenizer] (PhoBERT model)
plbart – (PLBart model)
prophetnet – [ProphetNetTokenizer] (ProphetNet model)
qdqbert – [BertTokenizer] or [BertTokenizerFast] (QDQBert model)
rag – [RagTokenizer] (RAG model)
realm – [RealmTokenizer] or [RealmTokenizerFast] (REALM model)
reformer – [ReformerTokenizerFast] (Reformer model)
rembert – [RemBertTokenizerFast] (RemBERT model)
retribert – [RetriBertTokenizer] or [RetriBertTokenizerFast] (RetriBERT model)
roberta – [RobertaTokenizer] or [RobertaTokenizerFast] (RoBERTa model)
roformer – [RoFormerTokenizer] or [RoFormerTokenizerFast] (RoFormer model)
speech_to_text – (Speech2Text model)
speech_to_text_2 – [Speech2Text2Tokenizer] (Speech2Text2 model)
splinter – [SplinterTokenizer] or [SplinterTokenizerFast] (Splinter model)
squeezebert – [SqueezeBertTokenizer] or [SqueezeBertTokenizerFast] (SqueezeBERT model)
t5 – [T5TokenizerFast] (T5 model)
tapas – [TapasTokenizer] (TAPAS model)
tapex – [TapexTokenizer] (TAPEX model)
transfo-xl – [TransfoXLTokenizer] (Transformer-XL model)
vilt – [BertTokenizer] or [BertTokenizerFast] (ViLT model)
visual_bert – [BertTokenizer] or [BertTokenizerFast] (VisualBERT model)
wav2vec2 – [Wav2Vec2CTCTokenizer] (Wav2Vec2 model)
wav2vec2-conformer – [Wav2Vec2CTCTokenizer] (Wav2Vec2-Conformer model)
wav2vec2_phoneme – [Wav2Vec2PhonemeCTCTokenizer] (Wav2Vec2Phoneme model)
whisper – (Whisper model)
xclip – [CLIPTokenizer] or [CLIPTokenizerFast] (X-CLIP model)
xglm – [XGLMTokenizerFast] (XGLM model)
xlm – [XLMTokenizer] (XLM model)
xlm-prophetnet – (XLM-ProphetNet model)
xlm-roberta – [XLMRobertaTokenizerFast] (XLM-RoBERTa model)
xlm-roberta-xl – [XLMRobertaTokenizerFast] (XLM-RoBERTa-XL model)
xlnet – [XLNetTokenizerFast] (XLNet model)
yoso – [AlbertTokenizerFast] (YOSO model)
- Params:
- pretrained_model_name_or_path (str or os.PathLike):
Can be either:
A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.
A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the [~PreTrainedTokenizer.save_pretrained] method, e.g., ./my_model_directory/.
A path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (like Bert or XLNet), e.g.: ./my_model_directory/vocab.txt. (Not applicable to all derived classes)
- inputs (additional positional arguments, optional):
Will be passed along to the Tokenizer __init__() method.
- config ([PretrainedConfig], optional)
The configuration object used to dertermine the tokenizer class to instantiate.
- cache_dir (str or os.PathLike, optional):
Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.
- force_download (bool, optional, defaults to False):
Whether or not to force the (re-)download the model weights and configuration files and override the cached versions if they exist.
- resume_download (bool, optional, defaults to False):
Whether or not to delete incompletely received files. Will attempt to resume the download if such a file exists.
- proxies (Dict[str, str], optional):
A dictionary of proxy servers to use by protocol or endpoint, e.g., {‘http’: ‘foo.bar:3128’, ‘http://hostname’: ‘foo.bar:4012’}. The proxies are used on each request.
- revision (str, optional, defaults to “main”):
The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.
- subfolder (str, optional):
In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for facebook/rag-token-base), specify it here.
- use_fast (bool, optional, defaults to True):
Whether or not to try to load the fast version of the tokenizer.
- tokenizer_type (str, optional):
Tokenizer type to be loaded.
- trust_remote_code (bool, optional, defaults to False):
Whether or not to allow for custom models defined on the Hub in their own modeling files. This option should only be set to True for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine.
- kwargs (additional keyword arguments, optional):
Will be passed to the Tokenizer __init__() method. Can be used to set special tokens like bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens. See parameters in the __init__() for more details.
Examples:
```python >>> from transformers import AutoTokenizer
>>> # Download vocabulary from huggingface.co and cache. >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> # Download vocabulary from huggingface.co (user-uploaded) and cache. >>> tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
>>> # If vocabulary files are in a directory (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*) >>> tokenizer = AutoTokenizer.from_pretrained("./test/bert_saved_model/")
>>> # Download vocabulary from huggingface.co and define model-specific arguments >>> tokenizer = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True) ```
- register(slow_tokenizer_class=None, fast_tokenizer_class=None)[source]¶
Register a new tokenizer in this mapping.
- Parameters
config_class ([PretrainedConfig]) – The configuration corresponding to the model to register.
slow_tokenizer_class ([PretrainedTokenizerFast], optional) – The slow tokenizer to register.
slow_tokenizer_class – The fast tokenizer to register.
- class EduNLP.Pretrain.BertDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]¶
- class EduNLP.Pretrain.BertTokenizer(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]¶
Examples
>>> tokenizer = BertTokenizer(add_special_tokens=True) >>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$" >>> token_item = tokenizer(item) >>> print(token_item.input_ids) tensor([[ 101, 1062, 2466, 1963, 1745, 138, 100, 140, 166, 117, 167, 5276, 3338, 3340, 816, 1062, 2466, 102, 168, 134, 166, 116, 128, 167, 3297, 1920, 966, 138, 100, 140, 102]]) >>> print(tokenizer.tokenize(item)[:10]) ['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y'] >>> items = [item, item] >>> token_items = tokenizer(items, return_tensors='pt') >>> print(token_items.input_ids.shape) torch.Size([2, 31]) >>> print(len(tokenizer.tokenize(items))) 2 >>> tokenizer.save_pretrained('test_dir') >>> tokenizer = BertTokenizer.from_pretrained('test_dir')
- class EduNLP.Pretrain.EduDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]¶
The base class implements a Dataset, which package the datasets.Dataset and provide more convenience, including parallel preprocessing, offline loadding and so on.
- Parameters
tokenizer – PretrainedEduTokenizer or model-specific Pretrained Tokenizer
ds_disk_path (HFDataset, optional) – the dataset_path to save dataset used by datasets.Dataset, by default None
items (Union[List[dict], List[str]], optional) – input items to process, by default None
stem_key (str, optional) – the content of items to process, by default “text”
label_key (Optional[str], optional) – the labels of items to process, by default None
feature_keys (Optional[List[str]], optional) – the additional features of items to remain, by default None
num_processor (int, optional) – specific the number of cpus for parallel speedup, by default None
- ds¶
map will break down for super large data which is greater than 4GB
- Type
Note
- class EduNLP.Pretrain.EduVocab(vocab_path: Optional[str] = None, corpus_items: Optional[List[str]] = None, bos_token: str = '[BOS]', eos_token: str = '[EOS]', pad_token: str = '[PAD]', unk_token: str = '[UNK]', specials: Optional[List[str]] = None, lower: bool = False, trim_min_count: int = 1, **kwargs)[source]¶
The vocabulary container for a corpus.
- Parameters
vocab_path (str, optional) – vocabulary path to initialize this container, by default None
corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None
bos_token (str, optional) – token representing for the start of a sentence, by default “[BOS]”
eos_token (str, optional) – token representing for the end of a sentence, by default “[EOS]”
pad_token (str, optional) – token representing for padding, by default “[PAD]”
unk_token (str, optional) – token representing for unknown word, by default “[UNK]”
specials (List[str], optional) – spacials tokens in vocabulary, by default None
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
- property vocab_size¶
- property special_tokens¶
- property tokens¶
- convert_sequence_to_idx(tokens, bos=False, eos=False)[source]¶
convert sentence of tokens to sentence of indexs
- set_vocab(corpus_items: List[str], lower: bool = False, trim_min_count: int = 1, silent=True)[source]¶
Update the vocabulary with the tokens in corpus items
- Parameters
corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
- load_vocab(vocab_path: str)[source]¶
Load the vocabulary from vocab_file
- Parameters
vocab_path (str) – path to save vocabulary file
- class EduNLP.Pretrain.ElmoDataset(tokenizer: ElmoTokenizer, **kwargs)[source]¶
- class EduNLP.Pretrain.ElmoTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials=True, **kwargs)[source]¶
Examples
>>> t=ElmoTokenizer() >>> items = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,\ ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"] >>> len(t) 14 >>> t.tokenize(items[0]) ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]'] >>> t(items[0]) {'seq_idx': tensor([1, 1, 6, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 7]), 'seq_len': tensor(17)} >>> t.set_vocab(items[0]) ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]'] >>> len(t) 45 >>> t(items[0]) {'seq_idx': tensor([ 1, 1, 6, 26, 27, 28, 1, 1, 9, 35, 36, 26, 37, 38, 28, 1, 7]), 'seq_len': tensor(17)}
- class EduNLP.Pretrain.PretrainedEduTokenizer(vocab_path: Optional[str] = None, max_length: int = 250, tokenize_method: str = 'pure_text', add_specials: Tuple[list, bool] = False, **kwargs)[source]¶
This base class is in charge of preparing the inputs for a model
- Parameters
vocab_path (str, optional) – _description_, by default None
max_length (int, optional) – used to clip the sentence out of max_length, by default None
tokenize_method (str, optional) – default: “space” - when text is already seperated by space, use “space” - when text is raw string format, use Tokenizer defined in get_tokenizer(), such as “pure_text” and “text”
add_specials (Tuple[list, bool], optional) – by default None - For bool, it means whether to add EDU_SPYMBOLS to vocabulary - For list, it means the added special tokens besides EDU_SPYMBOLS
- tokenize(items: ~typing.Tuple[list, str, dict], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]¶
- Parameters
items (list or str or dict) – the question items
key (function) – determine how to get the text of each item
- Returns
tokens – the token of items
- Return type
list
- encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]¶
- classmethod from_pretrained(tokenizer_config_dir: str, **kwargs)[source]¶
Load tokenizer from local files
Parameters:¶
- tokenizer_config_dir: str
The dir path containing tokenizer_config.json and vocab.list
- save_pretrained(tokenizer_config_dir: str)[source]¶
Save tokenizer into local files
Parameters:¶
- tokenizer_config_dir: str
save tokenizer params in /tokenizer_config.json and save words in /vocab.list
- property vocab_size¶
- set_vocab(items: list, key=<function PretrainedEduTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True)[source]¶
Update the vocabulary with the tokens in corpus items
- Parameters
items (list) – can be the list of str, or list of dict
key (function, optional) – determine how to get the text of each item
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True
- Returns
token_items
- Return type
list
- class EduNLP.Pretrain.TensorType(value)[source]¶
Possible values for the return_tensors argument in [PreTrainedTokenizerBase.__call__]. Useful for tab-completion in an IDE.
- PYTORCH = 'pt'¶
- TENSORFLOW = 'tf'¶
- NUMPY = 'np'¶
- JAX = 'jax'¶
- class EduNLP.Pretrain.TokenizerForHuggingface(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]¶
Parameterss¶
- pretrained_model:
used pretrained model
- add_specials:
Whether to add tokens like [FIGURE], [TAG], etc.
- tokenize_method:
Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.
Examples
>>> tokenizer = TokenizerForHuggingface(add_special_tokens=True) >>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$" >>> token_item = tokenizer(item) >>> print(token_item.input_ids[:10]) tensor([[ 101, 1062, 2466, 1963, 1745, 138, 100, 140, 166, 117, 167, 5276, 3338, 3340, 816, 1062, 2466, 102, 168, 134, 166, 116, 128, 167, 3297, 1920, 966, 138, 100, 140, 102]]) >>> print(tokenizer.tokenize(item)[:10]) ['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y'] >>> items = [item, item] >>> token_items = tokenizer(items, return_tensors='pt') >>> print(token_items.input_ids.shape) torch.Size([2, 31]) >>> print(len(tokenizer.tokenize(items))) 2 >>> tokenizer.save_pretrained('test_dir') >>> tokenizer = TokenizerForHuggingface.from_pretrained('test_dir')
- tokenize(items: ~typing.Union[list, str, dict], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]¶
- encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]¶
- property vocab_size¶
- set_vocab(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, lower=False, trim_min_count: int = 1, do_tokenize: bool = True)[source]¶
- Parameters
items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True
- EduNLP.Pretrain.finetune_bert(items: Union[List[dict], List[str]], output_dir: str, pretrained_model='bert-base-chinese', tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]¶
- Parameters
items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
Examples
>>> stems = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$", ... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$"] >>> finetune_bert(stems, "examples/test_model/data/data/bert") {'train_runtime': ..., ..., 'epoch': 1.0}
- EduNLP.Pretrain.finetune_bert_for_knowledge_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters
train_items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
- EduNLP.Pretrain.finetune_bert_for_property_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters
train_items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
- EduNLP.Pretrain.get_tokenizer(name, *args, **kwargs)[source]¶
It is a total interface to use difference tokenizer. :param name: the name of tokenizer, e.g. text, pure_text. :type name: str :param args: the parameters passed to tokenizer :param kwargs: the parameters passed to tokenizer
- Returns
tokenizer
- Return type
Examples
>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] >>> tokenizer = get_tokenizer("pure_text") >>> tokens = tokenizer(items) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '=']
- EduNLP.Pretrain.train_disenqnet(train_items: List[dict], output_dir: str, pretrained_dir: Optional[str] = None, eval_items=None, tokenizer_params=None, data_params=None, model_params=None, train_params=None, w2v_params=None)[source]¶
- Parameters
train_items (List[dict]) – _description_
output_dir (str) – _description_
pretrained_dir (str, optional) – _description_, by default None
tokenizer_params (_type_, optional) – _description_, by default None
data_params (_type_, optional) – _description_, by default None
model_params (_type_, optional) – _description_, by default None
train_params (_type_, optional) – _description_, by default None
- EduNLP.Pretrain.train_elmo(items: Union[List[dict], List[str]], output_dir: str, pretrained_dir: Optional[str] = None, tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]¶
- Parameters
items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) –
stem_key
label_key
The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
- EduNLP.Pretrain.train_elmo_for_knowledge_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters
train_items (list, required) – The training items, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
- EduNLP.Pretrain.train_elmo_for_property_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters
train_items (list, required) – The training items, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
EduNLP.Tokenizer¶
- class EduNLP.Tokenizer.SpaceTokenizer(stop_words='punctuations', **kwargs)[source]¶
Tokenize text by space. eg. “题目 内容” -> [“题目”, “内容”]
- Parameters
stop_words (str, optional) – stop_words to skip, by default “default”
- EduNLP.Tokenizer.get_tokenizer(name, *args, **kwargs)[source]¶
It is a total interface to use difference tokenizer. :param name: the name of tokenizer, e.g. text, pure_text. :type name: str :param args: the parameters passed to tokenizer :param kwargs: the parameters passed to tokenizer
- Returns
tokenizer
- Return type
Examples
>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] >>> tokenizer = get_tokenizer("pure_text") >>> tokens = tokenizer(items) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '=']
Vector¶
- class EduNLP.Vector.W2V(filepath, method=None, binary=None)[source]¶
The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.
- Parameters
filepath – path to the pretrained model file
method (str) – fasttext other(Word2Vec)
binary (bool) –
- property vectors¶
- property vector_size¶
- class EduNLP.Vector.D2V(filepath, method='d2v')[source]¶
It is a collection which include d2v, bow, tfidf method.
- Parameters
filepath –
method (str) – d2v bow tfidf
item –
- Returns
d2v model
- Return type
- property vector_size¶
- class EduNLP.Vector.BowLoader(filepath)[source]¶
Using doc2bow model, which has a lot of effects.
Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.
If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.
If allow_update is not set, this function is const, aka read-only.
- property vector_size¶
- class EduNLP.Vector.TfidfLoader(filepath)[source]¶
This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.
- property vector_size¶
- class EduNLP.Vector.RNNModel(rnn_type, w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), hidden_size, freeze_pretrained=True, model_params=None, device=None, **kwargs)[source]¶
Examples
>>> model = RNNModel("BiLSTM", None, 2, vocab_size=4, embedding_dim=3) >>> seq_idx = [[1, 2, 3], [1, 2, 0], [3, 0, 0]] >>> output, hn = model(seq_idx, indexing=False, padding=False) >>> seq_idx = [[1, 2, 3], [1, 2], [3]] >>> output, hn = model(seq_idx, indexing=False, padding=True) >>> output.shape torch.Size([3, 3, 4]) >>> hn.shape torch.Size([2, 3, 2]) >>> tokens = model.infer_tokens(seq_idx, indexing=False) >>> tokens.shape torch.Size([3, 3, 4]) >>> tokens = model.infer_tokens(seq_idx, agg="mean", indexing=False) >>> tokens.shape torch.Size([3, 4]) >>> item = model.infer_vector(seq_idx, indexing=False) >>> item.shape torch.Size([3, 4]) >>> item = model.infer_vector(seq_idx, agg="mean", indexing=False) >>> item.shape torch.Size([3, 2]) >>> item = model.infer_vector(seq_idx, agg=None, indexing=False) >>> item.shape torch.Size([2, 3, 2])
- infer_vector(items, agg: (<class 'int'>, <class 'str'>, None) = -1, indexing=True, padding=True, *args, **kwargs) Tensor[source]¶
- property vector_size: int¶
- property is_frozen¶
- class EduNLP.Vector.T2V(model: str, *args, **kwargs)[source]¶
The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.
- Parameters
model (str) – select the model type e.g.: d2v, rnn, lstm, gru, elmo, etc.
Examples
>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$, ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}] >>> model_dir = "examples/test_model/d2v" >>> url, model_name, *args = get_pretrained_model_info('d2v_test_256') >>> (); path = get_data(url, model_dir); () (...) >>> path = path_append(path, os.path.basename(path) + '.bin', to_str=True) >>> t2v = T2V('d2v',filepath=path) >>> print(t2v(item)) [array([...dtype=float32)]
- property vector_size: int¶
- EduNLP.Vector.get_pretrained_t2v(name, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]¶
It is a good idea if you want to switch token list to vector earily.
- Parameters
name (str) – select the pretrained model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512
model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’
- Returns
t2v model
- Return type
Examples
>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$, ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}] >>> i2v = get_pretrained_t2v("d2v_test_256", "examples/test_model/d2v") >>> print(i2v(item)) [array([...dtype=float32)]
- class EduNLP.Vector.Embedding(w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), freeze=True, device=None, **kwargs)[source]¶
-
- indexing(items: List[List[str]], padding=False, indexing=True) tuple[source]¶
- Parameters
items (list of list of str(word/token)) –
padding (bool) – whether padding the returned list with default pad_val to make all item in items have the same length
indexing (bool) –
- Returns
token_idx (list of list of int) – the list of the tokens of each item
token_len (list of int) – the list of the length of tokens of each item
- class EduNLP.Vector.BertModel(pretrained_model)[source]¶
Examples
>>> from EduNLP.Pretrain import BertTokenizer >>> tokenizer = BertTokenizer("bert-base-chinese", add_special_tokens=False) >>> model = BertModel("bert-base-chinese") >>> item = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,若$x,y$满足约束", ... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,若$x,y$满足约束"] >>> inputs = tokenizer(item, return_tensors='pt') >>> output = model(inputs) >>> output.shape torch.Size([2, 14, 768]) >>> tokens = model.infer_tokens(inputs) >>> tokens.shape torch.Size([2, 12, 768]) >>> tokens = model.infer_tokens(inputs, return_special_tokens=True) >>> tokens.shape torch.Size([2, 14, 768]) >>> item = model.infer_vector(inputs) >>> item.shape torch.Size([2, 768])
- property vector_size¶
- class EduNLP.Vector.QuesNetModel(pretrained_dir, img_dir=None, device='cpu', **kwargs)[source]¶
- infer_vector(items: Union[dict, list]) Tensor[source]¶
get question embedding with quesnet
- Parameters
items – encodes from tokenizer
- infer_tokens(items: Union[dict, list]) Tensor[source]¶
get token embeddings with quesnet
- Parameters
items – encodes from tokenizer
- Returns
word_embs + meta_emb
- Return type
torch.Tensor
- property vector_size¶
- class EduNLP.Vector.DisenQModel(pretrained_dir, device='cpu')[source]¶
- infer_vector(items: dict, vector_type=None, **kwargs) Tensor[source]¶
- Parameters
vector_type (str) – choose the type of items tensor to return. Default is None, which means return both (k_hidden, i_hidden) when vector_type=”k”, return k_hidden; when vector_type=”i”, return i_hidden;
- property vector_size¶
Pipeline¶
- EduNLP.Pipeline.pipeline(task: Optional[str] = None, model: Optional[Union[BaseModel, str]] = None, tokenizer: Optional[PretrainedEduTokenizer] = None, pipeline_class: Optional[Pipeline] = None, preprocess: Optional[List] = None, **kwargs)[source]¶
- Parameters
task (str, required) –
model (BaseModel or str, optional) –
tokenizer (PretrainedEduTokenizer, optional) –
pipeline_class (Pipeline, optional) – to specify Pipeline class
preprocess (list, optional) – a list of names of pre-process pipes
Examples
>>> processor = pipeline(task="property-prediction") >>> item = "如图所示,则三角形ABC的面积是_。" >>> processor(item)