EduNLP

SIF

EduNLP.SIF.sif.is_sif(item, check_formula=True, return_parser=False)[source]

the part aims to check whether the input is sif format

Parameters
  • item (str) – a raw item which respects stem

  • check_formula (bool) –

    whether to check the formulas when parsing item.

    True if check the validity of formulas in item False if not check the validity of formulas in item, which is faster

  • return_parser (bool) –

    whether to put the parsed item in return.

    when True, the format of return is (bool, Parser) when False, the format of return is bool

Returns

when item can not be parsed correctly, raise ValueError; when item is in stardarded format originally, return Ture (and the Parser of item); when item isn’t in stardarded format originally, return False (and the Parser of item);

Return type

bool

Examples

>>> text = '若$x,y$满足约束条件' \
...        '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$,' \
...        '则$z=x+7 y$的最大值$\\SIFUnderline$'
>>> is_sif(text)
True
>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...'
>>> ret = is_sif(text, return_parser=True)
>>> ret 
(False, <EduNLP.SIF.parser.parser.Parser object...>)
EduNLP.SIF.sif.to_sif(item, check_formula=True, parser: Optional[Parser] = None)[source]

the part aims to switch item to sif formate

Parameters
  • items (str) – a raw item which respects stem

  • check_formula (bool) – whether to check the formulas when parsing item (only work when parser=None).

  • parser (Parser) – the parser of item returned from is_sif.

Returns

item – the item which accords with sif format

Return type

str

Examples

>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...'
>>> siftext = to_sif(text)
>>> siftext
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...'
>>> ret = is_sif(text, return_parser=True)
>>> ret 
(False, <EduNLP.SIF.parser.parser.Parser object...>)
>>> to_sif(text, parser=ret[1])
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...
EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, mode: int = 2, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]

Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params

Parameters
  • item (str) – a raw item which respects stem

  • figures (dict) – when it is a dict, it means the id-to-instance for figures in ‘FormFigureID{…}’ format, when it is a bool, it means whether to instantiate figures in ‘FormFigureBase64{…}’ format

  • mode (int) – when safe = 2, use is_sif and check formula in item when safe = 1, use is_sif but don’t check formula in item when safe = 0, don’t use is_sif and don’t check anything in item

  • symbol (str) –

    select the methods to symbolize:

    ”t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep

  • tokenization (bool) – whether to tokenize item after segmentation

  • tokenization_params

    the dict of text_params, formula_params and figure_params in tokenization For formula_params:

    method: which tokenizer to be used, “linear” or “ast” The parameters only useful for “linear”:

    skip_figure_formula: whether to skip the formula in figure format symbolize_figure_formula: whether to symbolize the formula in figure format

    The parameters only useful for “ast”:

    ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables return_type: ‘list’ or ‘ast’

    More parameters can be found in the definition in SIF.tokenization.formula

    For figure_params:

    figure_instance:whether to return instance of figures in tokens

    For text_params: See definition in SIF.tokenization.text

    granularity: word or char stopwords: default or None or list

  • errors – warn, raise, coerce, strict, ignore

Returns

When tokenization is False, return SegmentList; When tokenization is True, return TokenList

Return type

list

Examples

>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> tl = sif4sci(test_item)
>>> tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.describe()
{'t': 2, 'f': 2, 'g': 1, 'm': 1}
>>> with tl.filter('fgm'):
...     tl
['如图所示', '面积']
>>> with tl.filter(keep='t'):
...     tl
['如图所示', '面积']
>>> with tl.filter():
...     tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.text_tokens
['如图所示', '面积']
>>> tl.formula_tokens
['\\bigtriangleup', 'ABC']
>>> tl.figure_tokens
[\FigureID{1}]
>>> tl.ques_mark_tokens
['\\SIFBlank']
>>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}})
['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]']
>>> sif4sci(test_item, symbol="tfgm")
['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]']
>>> sif4sci(test_item, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]']
>>> test_item_1 = {
...     "stem": r"若$x=2$, $y=\sqrt{x}$,则下列说法正确的是$\SIFChoice$",
...     "options": [r"$x < y$", r"$y = x$", r"$y < x$"]
... }
>>> tls = [
...     sif4sci(e, symbol="gm",
...     tokenization_params={
...         "formula_params": {
...             "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True,
...             "link_variable": False}
...     })
...     for e in ([test_item_1["stem"]] + test_item_1["options"])
... ]
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']]
>>> link_formulas(*tls)
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']]
>>> from EduNLP.utils import dict2str4sif
>>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False)
>>> test_item_1_str  
'$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl1 = sif4sci(test_item_1_str, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}})
>>> tl1.get_segments()[0]
['\\SIFTag{stem}']
>>> tl1.get_segments()[1:3]
[['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']]
>>> tl1.get_segments(add_seg_type=False)[0:3]
[['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']]
>>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]}
>>> test_item_2
{'options': ['$x < y$', '$y = x$', '$y < x$']}
>>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False)
>>> test_item_2_str
'$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl2 = sif4sci(test_item_2_str, symbol="gms",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
>>> tl2
['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x']
>>> tl2.get_segments(add_seg_type=False)
[['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']]
>>> tl2.get_segments(add_seg_type=False, drop="s")
[['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']]
>>> tl3 = sif4sci(test_item_1["stem"], symbol="gs")
>>> tl3.text_segments
[['说法', '正确']]
>>> tl3.formula_segments
[['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']]
>>> tl3.figure_segments
[]
>>> tl3.ques_mark_segments
[['\\SIFChoice']]
>>> test_item_3 = r"已知$y=x$,则以下说法中$\textf{正确,b}$的是"
>>> tl4 = sif4sci(test_item_3)
Warning: there is some chinese characters in formula!
>>> tl4.text_segments
[['已知'], ['说法', '中', '正确']]

EduNLP.Formula

EduNLP.Formula.ast.str2ast(formula: str, *args, **kwargs)[source]

给字符串的接口

EduNLP.Formula.ast.get_edges(forest)[source]

构造边集合

Parameters

forest (List[Dict]) – 森林

Returns

edges – 边集合

Return type

list of tuple(src,dst,type)

EduNLP.Formula.ast.ast(formula: (<class 'str'>, typing.List[typing.Dict]), index=0, forest_begin=0, father_tree=None, is_str=False)[source]

The origin code author is https://github.com/hxwujinze

Parameters
  • formula (str or List[Dict]) – 公式字符串或通过katex解析得到的结构体

  • index (int) – 本子树在树上的位置

  • forest_begin (int) – 本树在森林中的起始位置

  • father_tree (List[Dict]) – 父亲树

  • is_str (bool) –

Returns

  • tree (List[Dict]) – 重新解析形成的特征树

  • todo (finish all types)

Notes

Some functions are not supportd in katex e.g.,

  1. tag
    • \begin{equation} \tag{tagName} F=ma \end{equation}

    • \begin{align} \tag{1} y=x+z \end{align}

    • \tag*{hi} x+y^{2x}

  2. dddot
    • \frac{ \dddot y }{ x }

For more information, refer to katex support table

建森林

Parameters

forest (List[Dict]) –

Returns

trees

Return type

List[Dict]

EduNLP.Formula.ast.katex_parse(formula)[source]

将公式传入katex进行语法解析

EduNLP.I2V

class EduNLP.I2V.i2v.I2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]

It just a api, so you shouldn’t use it directly. If you want to get vector from item, you can use other model like D2V and W2V.

Parameters
  • tokenizer (str) – the name of tokenizer. eg. bert, pure_text, …

  • t2v (str) – the name of token2vector model

  • args – the parameters passed to t2v

  • tokenizer_kwargs (dict) – the parameters passed to tokenizer

  • pretrained_t2v (bool) –

    • True: use pretrained t2v model

    • False: use your own t2v model

  • model_dir (str) – local directionary for saving online pretrained models, work only when pretrained_t2v=True

  • kwargs – the parameters passed to t2v

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点,    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> model_dir = "examples/test_model/d2v"
>>> url, model_name, *args = get_pretrained_model_info('d2v_test_256')
>>> (); path = get_data(url, model_dir); () 
(...)
>>> path = path_append(path, os.path.basename(path) + '.bin', to_str=True)
>>> i2v = D2V("pure_text", "d2v", filepath=path, pretrained_t2v=False)
>>> i2v(item)
([array([ ...dtype=float32)], None)
Returns

i2v model

Return type

I2V

tokenize(items, *args, key=<function I2V.<lambda>>, **kwargs) list[source]
infer_vector(items, key=<function I2V.<lambda>>, **kwargs) tuple[source]
infer_item_vector(tokens, *args, **kwargs) ...[source]
infer_token_vector(tokens, *args, **kwargs) ...[source]
save(config_path)[source]
classmethod load(config_path, *args, **kwargs)[source]
classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
property vector_size
class EduNLP.I2V.i2v.D2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]

The model aims to transfer item to vector directly.

Bases

I2V

param tokenizer

the tokenizer name

type tokenizer

str

param t2v

the name of token2vector model

type t2v

str

param args

the parameters passed to t2v

param tokenizer_kwargs

the parameters passed to tokenizer

type tokenizer_kwargs

dict

param pretrained_t2v

True: use pretrained t2v model False: use your own t2v model

type pretrained_t2v

bool

param kwargs

the parameters passed to t2v

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点,    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> model_dir = "examples/test_model/d2v"
>>> url, model_name, *args = get_pretrained_model_info('d2v_test_256')
>>> (); path = get_data(url, model_dir); () 
(...)
>>> path = path_append(path, os.path.basename(path) + '.bin', to_str=True)
>>> i2v = D2V("pure_text","d2v",filepath=path, pretrained_t2v = False)
>>> i2v(item)
([array([ ...dtype=float32)], None)
returns

i2v model

rtype

I2V

infer_vector(items, tokenize=True, key=<function D2V.<lambda>>, *args, **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is necessary to load model.

Parameters
  • items (str) – the text of question

  • tokenize (bool) – True: tokenize the item

  • key (function) – determine how to get the text of each item

  • args – the parameters passed to t2v

  • kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
class EduNLP.I2V.i2v.W2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]

The model aims to transfer tokens to vector.

Bases

I2V

param tokenizer

the tokenizer name

type tokenizer

str

param t2v

the name of token2vector model

type t2v

str

param args

the parameters passed to t2v

param tokenizer_kwargs

the parameters passed to tokenizer

type tokenizer_kwargs

dict

param pretrained_t2v

True: use pretrained t2v model False: use your own t2v model

type pretrained_t2v

bool

param kwargs

the parameters passed to t2v

Examples

>>> (); i2v = get_pretrained_i2v("w2v_test_256", "examples/test_model/w2v"); () 
(...)
>>> item_vector, token_vector = i2v(["有学者认为:‘学习’,必须适应实际"]) 
>>> item_vector 
[array([...], dtype=float32)]
returns

i2v model

rtype

W2V

infer_vector(items, tokenize=True, key=<function W2V.<lambda>>, *args, **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is necessary to load model.

Parameters
  • items (str) – the text of question

  • tokenize (bool) – True: tokenize the item

  • key (function) – determine how to get the text of each item

  • args – the parameters passed to t2v

  • kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
class EduNLP.I2V.i2v.Elmo(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]

The model aims to transfer item and tokens to vector with Elmo.

Bases

I2V

param tokenizer

the tokenizer name

type tokenizer

str

param t2v

the name of token2vector model

type t2v

str

param args

the parameters passed to t2v

param tokenizer_kwargs

the parameters passed to tokenizer

type tokenizer_kwargs

dict

param pretrained_t2v

True: use pretrained t2v model False: use your own t2v model

type pretrained_t2v

bool

param kwargs

the parameters passed to t2v

returns

i2v model

rtype

Elmo

infer_vector(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict], str, dict], *args, key=<function Elmo.<lambda>>, **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is necessary to load model.

Parameters
  • items (str or dict or list) – the item of question, or question list

  • return_tensors (str) – tensor type used in tokenizer

  • args – the parameters passed to t2v

  • kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
class EduNLP.I2V.i2v.Bert(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]

The model aims to transfer item and tokens to vector with Bert.

Bases

I2V

param tokenizer

the tokenizer name

type tokenizer

str

param t2v

the name of token2vector model

type t2v

str

param args

the parameters passed to t2v

param tokenizer_kwargs

the parameters passed to tokenizer

type tokenizer_kwargs

dict

param pretrained_t2v

True: use pretrained t2v model False: use your own t2v model

type pretrained_t2v

bool

param kwargs

the parameters passed to t2v

returns

i2v model

rtype

Bert

infer_vector(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict], str, dict], *args, key=<function Bert.<lambda>>, return_tensors='pt', **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is nesseary to load model.

Parameters
  • items (str or dict or list) – the item of question, or question list

  • return_tensors (str) – tensor type used in tokenizer

  • args – the parameters passed to t2v

  • kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
class EduNLP.I2V.i2v.DisenQ(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]

The model aims to transfer item and tokens to vector with DisenQ. Bases ——- I2V :param tokenizer: the tokenizer name :type tokenizer: str :param t2v: the name of token2vector model :type t2v: str :param args: the parameters passed to t2v :param tokenizer_kwargs: the parameters passed to tokenizer :type tokenizer_kwargs: dict :param pretrained_t2v: True: use pretrained t2v model

False: use your own t2v model

Parameters

kwargs – the parameters passed to t2v

Returns

i2v model

Return type

DisenQ

infer_vector(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict], str, dict], *args, key=<function DisenQ.<lambda>>, vector_type=None, **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is nesseary to load model. :param items: the item of question, or question list :type items: str or dict or list :param key: determine how to get the text of each item :type key: function :param args: the parameters passed to t2v :param kwargs: the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]
class EduNLP.I2V.i2v.QuesNet(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]

The model aims to transfer item and tokens to vector with quesnet. Bases ——- I2V

infer_vector(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict], str, dict], *args, key=<function QuesNet.<lambda>>, meta=['know_name'], **kwargs)[source]

It is a function to switch item to vector. And before using the function, it is nesseary to load model. :param items: the item of question, or question list :type items: str or dict or list :param tokenize: True: tokenize the item :type tokenize: bool, optional :param key: determine how to get the text of each item, by default lambdax: x :type key: function, optional :param meta: meta information, by default [‘know_name’] :type meta: list, optional :param args: the parameters passed to t2v :param kwargs: the parameters passed to t2v

Returns

  • token embeddings

  • question embedding

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
EduNLP.I2V.i2v.get_pretrained_i2v(name, model_dir='/home/docs/.EduNLP/model')[source]

It is a good idea if you want to switch item to vector earily.

Parameters
  • name (str) – the name of item2vector model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512

  • model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’

Returns

i2v model

Return type

I2V

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点,    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> (); i2v = get_pretrained_i2v("d2v_test_256", "examples/test_model/d2v"); () 
(...)
>>> print(i2v(item)) 
([array([ ...dtype=float32)], None)

EduNLP.Pretrain

EduNLP.Pretrain.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]
Parameters
  • items:str – the text of question

  • w2v_prefix

  • embedding_dim (int) – vector_size

  • method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf

  • binary (model format) – True:bin; False:kv

  • train_params (dict) – the training parameters passed to model

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 
'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'
class EduNLP.Pretrain.GensimWordTokenizer(symbol='gm', general=False)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g.: gm, fgm, gmas, fgmas

  • general (bool) –

    True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.

    False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
batch_process(*items)[source]
class EduNLP.Pretrain.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g. gms, fgm

  • depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]
class EduNLP.Pretrain.QuesNetTokenizer(vocab_path=None, meta_vocab_dir=None, img_dir: Optional[str] = None, max_length=250, tokenize_method='custom', symbol='mas', add_specials: Optional[list] = None, meta: Optional[List[str]] = None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', **kwargs)[source]

Examples

>>> tokenizer = QuesNetTokenizer(meta=['knowledge'])
>>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$",
... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B",
... "knowledge": "['*', '-', '/']"}]
>>> tokenizer.set_vocab(test_items,
... trim_min_count=1, key=lambda x: x["ques_content"], silent=True)
>>> tokenizer.set_meta_vocab(test_items, silent=True)
>>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['seq_idx', 'meta_idx'])
>>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"])
>>> print(len(token_items["seq_idx"]))
2
load_meta_vocab(meta_vocab_dir)[source]
set_meta_vocab(items: list, meta: Optional[List[str]] = None, silent=True)[source]
set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True, silent=True)[source]
Parameters
  • items (list) – can be the list of str, or list of dict

  • key (function) – determine how to get the text of each item

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

  • silent

classmethod from_pretrained(tokenizer_config_dir, img_dir=None, **kwargs)[source]

Parameters:

tokenizer_config_dir: str

must contain tokenizer_config.json and vocab.txt and meta_{meta_name}.txt

img_dir: str

default None the path of image directory

save_pretrained(tokenizer_config_dir)[source]

Save tokenizer into local files

Parameters:

tokenizer_config_dir: str

save tokenizer params in /tokenizer_config.json and save words in vocab.txt and save metas in meta_{meta_name}.txt

padding(idx, max_length, type='word')[source]
set_img_dir(path)[source]
EduNLP.Pretrain.pretrain_quesnet(path, output_dir, img_dir=None, save_embs=False, train_params=None)[source]

pretrain quesnet

Parameters
  • path (str) – path of question file

  • output_dir (str) – output path·

  • tokenizer (QuesNetTokenizer) – quesnet tokenizer

  • save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately

  • train_params (dict, optional) –

    the training parameters and model parameters, by default None - “n_epochs”: int, default = 1

    train param, number of epochs

    • ”batch_size”: int, default = 6

      train param, batch size

    • ”lr”: float, default = 1e-3

      train param, learning rate

    • ”save_every”: int, default = 0

      train param, save steps interval

    • ”log_steps”: int, default = 10

      train param, log steps interval

    • ”device”: str, default = ‘cpu’

      train param, ‘cpu’ or ‘cuda’

    • ”max_steps”: int, default = 0

      train param, stop training when reach max steps

    • ”emb_size”: int, default = 256

      model param, the embedding size of word, figure, meta info

    • ”feat_size”: int, default = 256

      model param, the size of question infer vector

Examples

>>> tokenizer = QuesNetTokenizer(meta=['know_name'])
>>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$,则$|z|=$,$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$",
... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb",
... "know_name": "['代数', '集合', '集合的相等']"
... }]
>>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True)
>>> pretrain_quesnet('./data/standard_luna_data.json', './testQuesNet', tokenizer) 
class EduNLP.Pretrain.Question(id, content, answer, false_options, labels)
property answer

Alias for field number 2

property content

Alias for field number 1

property false_options

Alias for field number 3

property id

Alias for field number 0

property labels

Alias for field number 4

class EduNLP.Pretrain.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials: Optional[list] = None, num_token='[NUM]', **kwargs)[source]

Examples

>>> tokenizer = DisenQTokenizer()
>>> test_items = [{
...     "content": "甲 数 除以 乙 数 的 商 是 1.5 , 如果 甲 数 增加 20 , 则 甲 数 是 乙 的 4 倍 . 原来 甲 数 = .",
...     "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}]
>>> tokenizer.set_vocab(test_items,
...     key=lambda x: x["content"], trim_min_count=1)
[['甲', '数', '除以', '乙', '数', '商', '[NUM]', '甲', '数', '增加', '[NUM]', '甲', '数', '乙', '倍', '甲', '数']]
>>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['seq_idx', 'seq_len'])
class EduNLP.Pretrain.AutoTokenizer[source]

This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the [AutoTokenizer.from_pretrained] class method.

This class cannot be instantiated directly using __init__() (throws an error).

classmethod from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)[source]

Instantiate one of the tokenizer classes of the library from a pretrained model vocabulary.

The tokenizer class to instantiate is selected based on the model_type property of the config object (either passed as an argument or loaded from pretrained_model_name_or_path if possible), or when it’s missing, by falling back to using pattern matching on pretrained_model_name_or_path:

  • albert – [AlbertTokenizerFast] (ALBERT model)

  • bart – [BartTokenizer] or [BartTokenizerFast] (BART model)

  • barthez – [BarthezTokenizerFast] (BARThez model)

  • bartpho – [BartphoTokenizer] (BARTpho model)

  • bert – [BertTokenizer] or [BertTokenizerFast] (BERT model)

  • bert-generation – (Bert Generation model)

  • bert-japanese – [BertJapaneseTokenizer] (BertJapanese model)

  • bertweet – [BertweetTokenizer] (BERTweet model)

  • big_bird – [BigBirdTokenizerFast] (BigBird model)

  • bigbird_pegasus – [PegasusTokenizer] or [PegasusTokenizerFast] (BigBird-Pegasus model)

  • blenderbot – [BlenderbotTokenizer] or [BlenderbotTokenizerFast] (Blenderbot model)

  • blenderbot-small – [BlenderbotSmallTokenizer] (BlenderbotSmall model)

  • bloom – [BloomTokenizerFast] (BLOOM model)

  • byt5 – [ByT5Tokenizer] (ByT5 model)

  • camembert – [CamembertTokenizerFast] (CamemBERT model)

  • canine – [CanineTokenizer] (CANINE model)

  • clip – [CLIPTokenizer] or [CLIPTokenizerFast] (CLIP model)

  • codegen – [CodeGenTokenizer] or [CodeGenTokenizerFast] (CodeGen model)

  • convbert – [ConvBertTokenizer] or [ConvBertTokenizerFast] (ConvBERT model)

  • cpm – [CpmTokenizerFast] (CPM model)

  • ctrl – [CTRLTokenizer] (CTRL model)

  • data2vec-text – [RobertaTokenizer] or [RobertaTokenizerFast] (Data2VecText model)

  • deberta – [DebertaTokenizer] or [DebertaTokenizerFast] (DeBERTa model)

  • deberta-v2 – [DebertaV2TokenizerFast] (DeBERTa-v2 model)

  • distilbert – [DistilBertTokenizer] or [DistilBertTokenizerFast] (DistilBERT model)

  • dpr – [DPRQuestionEncoderTokenizer] or [DPRQuestionEncoderTokenizerFast] (DPR model)

  • electra – [ElectraTokenizer] or [ElectraTokenizerFast] (ELECTRA model)

  • ernie – [BertTokenizer] or [BertTokenizerFast] (ERNIE model)

  • esm – [EsmTokenizer] (ESM model)

  • flaubert – [FlaubertTokenizer] (FlauBERT model)

  • fnet – [FNetTokenizer] or [FNetTokenizerFast] (FNet model)

  • fsmt – [FSMTTokenizer] (FairSeq Machine-Translation model)

  • funnel – [FunnelTokenizer] or [FunnelTokenizerFast] (Funnel Transformer model)

  • gpt2 – [GPT2Tokenizer] or [GPT2TokenizerFast] (OpenAI GPT-2 model)

  • gpt_neo – [GPT2Tokenizer] or [GPT2TokenizerFast] (GPT Neo model)

  • gpt_neox – [GPTNeoXTokenizerFast] (GPT NeoX model)

  • gpt_neox_japanese – [GPTNeoXJapaneseTokenizer] (GPT NeoX Japanese model)

  • gptj – [GPT2Tokenizer] or [GPT2TokenizerFast] (GPT-J model)

  • groupvit – [CLIPTokenizer] or [CLIPTokenizerFast] (GroupViT model)

  • herbert – [HerbertTokenizer] or [HerbertTokenizerFast] (HerBERT model)

  • hubert – [Wav2Vec2CTCTokenizer] (Hubert model)

  • ibert – [RobertaTokenizer] or [RobertaTokenizerFast] (I-BERT model)

  • layoutlm – [LayoutLMTokenizer] or [LayoutLMTokenizerFast] (LayoutLM model)

  • layoutlmv2 – [LayoutLMv2Tokenizer] or [LayoutLMv2TokenizerFast] (LayoutLMv2 model)

  • layoutlmv3 – [LayoutLMv3Tokenizer] or [LayoutLMv3TokenizerFast] (LayoutLMv3 model)

  • layoutxlm – [LayoutXLMTokenizer] or [LayoutXLMTokenizerFast] (LayoutXLM model)

  • led – [LEDTokenizer] or [LEDTokenizerFast] (LED model)

  • lilt – [LayoutLMv3Tokenizer] or [LayoutLMv3TokenizerFast] (LiLT model)

  • longformer – [LongformerTokenizer] or [LongformerTokenizerFast] (Longformer model)

  • longt5 – [T5TokenizerFast] (LongT5 model)

  • luke – [LukeTokenizer] (LUKE model)

  • lxmert – [LxmertTokenizer] or [LxmertTokenizerFast] (LXMERT model)

  • m2m_100 – (M2M100 model)

  • marian – (Marian model)

  • mbart – [MBartTokenizerFast] (mBART model)

  • mbart50 – [MBart50TokenizerFast] (mBART-50 model)

  • megatron-bert – [BertTokenizer] or [BertTokenizerFast] (Megatron-BERT model)

  • mluke – (mLUKE model)

  • mobilebert – [MobileBertTokenizer] or [MobileBertTokenizerFast] (MobileBERT model)

  • mpnet – [MPNetTokenizer] or [MPNetTokenizerFast] (MPNet model)

  • mt5 – [MT5TokenizerFast] (MT5 model)

  • mvp – [MvpTokenizer] or [MvpTokenizerFast] (MVP model)

  • nezha – [BertTokenizer] or [BertTokenizerFast] (Nezha model)

  • nllb – [NllbTokenizerFast] (NLLB model)

  • nystromformer – [AlbertTokenizerFast] (Nyströmformer model)

  • openai-gpt – [OpenAIGPTTokenizer] or [OpenAIGPTTokenizerFast] (OpenAI GPT model)

  • opt – [GPT2Tokenizer] (OPT model)

  • owlvit – [CLIPTokenizer] or [CLIPTokenizerFast] (OWL-ViT model)

  • pegasus – [PegasusTokenizerFast] (Pegasus model)

  • pegasus_x – [PegasusTokenizerFast] (PEGASUS-X model)

  • perceiver – [PerceiverTokenizer] (Perceiver model)

  • phobert – [PhobertTokenizer] (PhoBERT model)

  • plbart – (PLBart model)

  • prophetnet – [ProphetNetTokenizer] (ProphetNet model)

  • qdqbert – [BertTokenizer] or [BertTokenizerFast] (QDQBert model)

  • rag – [RagTokenizer] (RAG model)

  • realm – [RealmTokenizer] or [RealmTokenizerFast] (REALM model)

  • reformer – [ReformerTokenizerFast] (Reformer model)

  • rembert – [RemBertTokenizerFast] (RemBERT model)

  • retribert – [RetriBertTokenizer] or [RetriBertTokenizerFast] (RetriBERT model)

  • roberta – [RobertaTokenizer] or [RobertaTokenizerFast] (RoBERTa model)

  • roformer – [RoFormerTokenizer] or [RoFormerTokenizerFast] (RoFormer model)

  • speech_to_text – (Speech2Text model)

  • speech_to_text_2 – [Speech2Text2Tokenizer] (Speech2Text2 model)

  • splinter – [SplinterTokenizer] or [SplinterTokenizerFast] (Splinter model)

  • squeezebert – [SqueezeBertTokenizer] or [SqueezeBertTokenizerFast] (SqueezeBERT model)

  • t5 – [T5TokenizerFast] (T5 model)

  • tapas – [TapasTokenizer] (TAPAS model)

  • tapex – [TapexTokenizer] (TAPEX model)

  • transfo-xl – [TransfoXLTokenizer] (Transformer-XL model)

  • vilt – [BertTokenizer] or [BertTokenizerFast] (ViLT model)

  • visual_bert – [BertTokenizer] or [BertTokenizerFast] (VisualBERT model)

  • wav2vec2 – [Wav2Vec2CTCTokenizer] (Wav2Vec2 model)

  • wav2vec2-conformer – [Wav2Vec2CTCTokenizer] (Wav2Vec2-Conformer model)

  • wav2vec2_phoneme – [Wav2Vec2PhonemeCTCTokenizer] (Wav2Vec2Phoneme model)

  • whisper – (Whisper model)

  • xclip – [CLIPTokenizer] or [CLIPTokenizerFast] (X-CLIP model)

  • xglm – [XGLMTokenizerFast] (XGLM model)

  • xlm – [XLMTokenizer] (XLM model)

  • xlm-prophetnet – (XLM-ProphetNet model)

  • xlm-roberta – [XLMRobertaTokenizerFast] (XLM-RoBERTa model)

  • xlm-roberta-xl – [XLMRobertaTokenizerFast] (XLM-RoBERTa-XL model)

  • xlnet – [XLNetTokenizerFast] (XLNet model)

  • yoso – [AlbertTokenizerFast] (YOSO model)

Params:
pretrained_model_name_or_path (str or os.PathLike):

Can be either:

  • A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.

  • A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the [~PreTrainedTokenizer.save_pretrained] method, e.g., ./my_model_directory/.

  • A path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (like Bert or XLNet), e.g.: ./my_model_directory/vocab.txt. (Not applicable to all derived classes)

inputs (additional positional arguments, optional):

Will be passed along to the Tokenizer __init__() method.

config ([PretrainedConfig], optional)

The configuration object used to dertermine the tokenizer class to instantiate.

cache_dir (str or os.PathLike, optional):

Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.

force_download (bool, optional, defaults to False):

Whether or not to force the (re-)download the model weights and configuration files and override the cached versions if they exist.

resume_download (bool, optional, defaults to False):

Whether or not to delete incompletely received files. Will attempt to resume the download if such a file exists.

proxies (Dict[str, str], optional):

A dictionary of proxy servers to use by protocol or endpoint, e.g., {‘http’: ‘foo.bar:3128’, ‘http://hostname’: ‘foo.bar:4012’}. The proxies are used on each request.

revision (str, optional, defaults to “main”):

The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.

subfolder (str, optional):

In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for facebook/rag-token-base), specify it here.

use_fast (bool, optional, defaults to True):

Whether or not to try to load the fast version of the tokenizer.

tokenizer_type (str, optional):

Tokenizer type to be loaded.

trust_remote_code (bool, optional, defaults to False):

Whether or not to allow for custom models defined on the Hub in their own modeling files. This option should only be set to True for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine.

kwargs (additional keyword arguments, optional):

Will be passed to the Tokenizer __init__() method. Can be used to set special tokens like bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens. See parameters in the __init__() for more details.

Examples:

```python >>> from transformers import AutoTokenizer

>>> # Download vocabulary from huggingface.co and cache.
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> # Download vocabulary from huggingface.co (user-uploaded) and cache.
>>> tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
>>> # If vocabulary files are in a directory (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*)
>>> tokenizer = AutoTokenizer.from_pretrained("./test/bert_saved_model/")
>>> # Download vocabulary from huggingface.co and define model-specific arguments
>>> tokenizer = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)
```
register(slow_tokenizer_class=None, fast_tokenizer_class=None)[source]

Register a new tokenizer in this mapping.

Parameters
  • config_class ([PretrainedConfig]) – The configuration corresponding to the model to register.

  • slow_tokenizer_class ([PretrainedTokenizerFast], optional) – The slow tokenizer to register.

  • slow_tokenizer_class – The fast tokenizer to register.

class EduNLP.Pretrain.BertDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]
class EduNLP.Pretrain.BertTokenizer(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]

Examples

>>> tokenizer = BertTokenizer(add_special_tokens=True)
>>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"
>>> token_item = tokenizer(item)
>>> print(token_item.input_ids)
tensor([[ 101, 1062, 2466, 1963, 1745,  138,  100,  140,  166,  117,  167, 5276,
         3338, 3340,  816, 1062, 2466,  102,  168,  134,  166,  116,  128,  167,
         3297, 1920,  966,  138,  100,  140,  102]])
>>> print(tokenizer.tokenize(item)[:10])
['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y']
>>> items = [item, item]
>>> token_items = tokenizer(items, return_tensors='pt')
>>> print(token_items.input_ids.shape)
torch.Size([2, 31])
>>> print(len(tokenizer.tokenize(items)))
2
>>> tokenizer.save_pretrained('test_dir') 
>>> tokenizer = BertTokenizer.from_pretrained('test_dir') 
class EduNLP.Pretrain.EduDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]

The base class implements a Dataset, which package the datasets.Dataset and provide more convenience, including parallel preprocessing, offline loadding and so on.

Parameters
  • tokenizer – PretrainedEduTokenizer or model-specific Pretrained Tokenizer

  • ds_disk_path (HFDataset, optional) – the dataset_path to save dataset used by datasets.Dataset, by default None

  • items (Union[List[dict], List[str]], optional) – input items to process, by default None

  • stem_key (str, optional) – the content of items to process, by default “text”

  • label_key (Optional[str], optional) – the labels of items to process, by default None

  • feature_keys (Optional[List[str]], optional) – the additional features of items to remain, by default None

  • num_processor (int, optional) – specific the number of cpus for parallel speedup, by default None

ds

map will break down for super large data which is greater than 4GB

Type

Note

to_disk(ds_disk_path)[source]

Save the processed dataset into local files

collect_fn()[source]
class EduNLP.Pretrain.EduVocab(vocab_path: Optional[str] = None, corpus_items: Optional[List[str]] = None, bos_token: str = '[BOS]', eos_token: str = '[EOS]', pad_token: str = '[PAD]', unk_token: str = '[UNK]', specials: Optional[List[str]] = None, lower: bool = False, trim_min_count: int = 1, **kwargs)[source]

The vocabulary container for a corpus.

Parameters
  • vocab_path (str, optional) – vocabulary path to initialize this container, by default None

  • corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None

  • bos_token (str, optional) – token representing for the start of a sentence, by default “[BOS]”

  • eos_token (str, optional) – token representing for the end of a sentence, by default “[EOS]”

  • pad_token (str, optional) – token representing for padding, by default “[PAD]”

  • unk_token (str, optional) – token representing for unknown word, by default “[UNK]”

  • specials (List[str], optional) – spacials tokens in vocabulary, by default None

  • lower (bool, optional) – wheather to lower the corpus items, by default False

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

property vocab_size
property special_tokens
property tokens
to_idx(token)[source]

convert token to index

to_token(idx)[source]

convert index to index

convert_sequence_to_idx(tokens, bos=False, eos=False)[source]

convert sentence of tokens to sentence of indexs

convert_sequence_to_token(idxs, **kwargs)[source]

convert sentence of indexs to sentence of tokens

set_vocab(corpus_items: List[str], lower: bool = False, trim_min_count: int = 1, silent=True)[source]

Update the vocabulary with the tokens in corpus items

Parameters
  • corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None

  • lower (bool, optional) – wheather to lower the corpus items, by default False

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

load_vocab(vocab_path: str)[source]

Load the vocabulary from vocab_file

Parameters

vocab_path (str) – path to save vocabulary file

save_vocab(vocab_path: str)[source]

Save the vocabulary into vocab_file

Parameters

vocab_path (str) – path to save vocabulary file

add_specials(tokens: List[str])[source]

Add special tokens into vocabulary

add_tokens(tokens: List[str])[source]

Add tokens into vocabulary

class EduNLP.Pretrain.ElmoDataset(tokenizer: ElmoTokenizer, **kwargs)[source]
collate_fn(batch_data)[source]
class EduNLP.Pretrain.ElmoTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials=True, **kwargs)[source]

Examples

>>> t=ElmoTokenizer()
>>> items = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"]
>>> len(t)
14
>>> t.tokenize(items[0])
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> t(items[0])
{'seq_idx': tensor([1, 1, 6, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 7]), 'seq_len': tensor(17)}
>>> t.set_vocab(items[0])
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> len(t)
45
>>> t(items[0])
{'seq_idx': tensor([ 1,  1,  6, 26, 27, 28,  1,  1,  9, 35, 36, 26, 37, 38, 28,  1,  7]), 'seq_len': tensor(17)}
class EduNLP.Pretrain.PretrainedEduTokenizer(vocab_path: Optional[str] = None, max_length: int = 250, tokenize_method: str = 'pure_text', add_specials: Tuple[list, bool] = False, **kwargs)[source]

This base class is in charge of preparing the inputs for a model

Parameters
  • vocab_path (str, optional) – _description_, by default None

  • max_length (int, optional) – used to clip the sentence out of max_length, by default None

  • tokenize_method (str, optional) – default: “space” - when text is already seperated by space, use “space” - when text is raw string format, use Tokenizer defined in get_tokenizer(), such as “pure_text” and “text”

  • add_specials (Tuple[list, bool], optional) – by default None - For bool, it means whether to add EDU_SPYMBOLS to vocabulary - For list, it means the added special tokens besides EDU_SPYMBOLS

tokenize(items: ~typing.Tuple[list, str, dict], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]
Parameters
  • items (list or str or dict) – the question items

  • key (function) – determine how to get the text of each item

Returns

tokens – the token of items

Return type

list

encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]
decode(token_ids: list, key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]
classmethod from_pretrained(tokenizer_config_dir: str, **kwargs)[source]

Load tokenizer from local files

Parameters:

tokenizer_config_dir: str

The dir path containing tokenizer_config.json and vocab.list

save_pretrained(tokenizer_config_dir: str)[source]

Save tokenizer into local files

Parameters:

tokenizer_config_dir: str

save tokenizer params in /tokenizer_config.json and save words in /vocab.list

property vocab_size
set_vocab(items: list, key=<function PretrainedEduTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True)[source]

Update the vocabulary with the tokens in corpus items

Parameters
  • items (list) – can be the list of str, or list of dict

  • key (function, optional) – determine how to get the text of each item

  • lower (bool, optional) – wheather to lower the corpus items, by default False

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

  • do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True

Returns

token_items

Return type

list

add_specials(tokens)[source]

Add special tokens into vocabulary

add_tokens(tokens)[source]

Add tokens into vocabulary

class EduNLP.Pretrain.TensorType(value)[source]

Possible values for the return_tensors argument in [PreTrainedTokenizerBase.__call__]. Useful for tab-completion in an IDE.

PYTORCH = 'pt'
TENSORFLOW = 'tf'
NUMPY = 'np'
JAX = 'jax'
class EduNLP.Pretrain.TokenizerForHuggingface(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]

Parameterss

pretrained_model:

used pretrained model

add_specials:

Whether to add tokens like [FIGURE], [TAG], etc.

tokenize_method:

Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.

Examples

>>> tokenizer = TokenizerForHuggingface(add_special_tokens=True)
>>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"
>>> token_item = tokenizer(item)
>>> print(token_item.input_ids[:10])
tensor([[ 101, 1062, 2466, 1963, 1745,  138,  100,  140,  166,  117,  167, 5276,
         3338, 3340,  816, 1062, 2466,  102,  168,  134,  166,  116,  128,  167,
         3297, 1920,  966,  138,  100,  140,  102]])
>>> print(tokenizer.tokenize(item)[:10])
['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y']
>>> items = [item, item]
>>> token_items = tokenizer(items, return_tensors='pt')
>>> print(token_items.input_ids.shape)
torch.Size([2, 31])
>>> print(len(tokenizer.tokenize(items)))
2
>>> tokenizer.save_pretrained('test_dir') 
>>> tokenizer = TokenizerForHuggingface.from_pretrained('test_dir') 
tokenize(items: ~typing.Union[list, str, dict], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]
encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]
decode(token_ids: list, key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]
classmethod from_pretrained(tokenizer_config_dir, **kwargs)[source]
save_pretrained(tokenizer_config_dir)[source]
property vocab_size
set_vocab(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, lower=False, trim_min_count: int = 1, do_tokenize: bool = True)[source]
Parameters
  • items (list) – can be the list of str, or list of dict

  • key (function) – determine how to get the text of each item

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

  • do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True

add_specials(added_spectials: List[str])[source]
add_tokens(added_tokens: List[str])[source]
EduNLP.Pretrain.finetune_bert(items: Union[List[dict], List[str]], output_dir: str, pretrained_model='bert-base-chinese', tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]
Parameters
  • items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

Examples

>>> stems = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$",
... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$"]
>>> finetune_bert(stems, "examples/test_model/data/data/bert") 
{'train_runtime': ..., ..., 'epoch': 1.0}
EduNLP.Pretrain.finetune_bert_for_knowledge_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters
  • train_items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.finetune_bert_for_property_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters
  • train_items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.get_tokenizer(name, *args, **kwargs)[source]

It is a total interface to use difference tokenizer. :param name: the name of tokenizer, e.g. text, pure_text. :type name: str :param args: the parameters passed to tokenizer :param kwargs: the parameters passed to tokenizer

Returns

tokenizer

Return type

Tokenizer

Examples

>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokenizer = get_tokenizer("pure_text")
>>> tokens = tokenizer(items)
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']
EduNLP.Pretrain.train_disenqnet(train_items: List[dict], output_dir: str, pretrained_dir: Optional[str] = None, eval_items=None, tokenizer_params=None, data_params=None, model_params=None, train_params=None, w2v_params=None)[source]
Parameters
  • train_items (List[dict]) – _description_

  • output_dir (str) – _description_

  • pretrained_dir (str, optional) – _description_, by default None

  • tokenizer_params (_type_, optional) – _description_, by default None

  • data_params (_type_, optional) – _description_, by default None

  • model_params (_type_, optional) – _description_, by default None

  • train_params (_type_, optional) – _description_, by default None

EduNLP.Pretrain.train_elmo(items: Union[List[dict], List[str]], output_dir: str, pretrained_dir: Optional[str] = None, tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]
Parameters
  • items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_dir (str, optional) – The pretrained directory for model and tokenizer

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) –

    • stem_key

    • label_key

    The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.train_elmo_for_knowledge_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters
  • train_items (list, required) – The training items, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_dir (str, optional) – The pretrained directory for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.train_elmo_for_property_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters
  • train_items (list, required) – The training items, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_dir (str, optional) – The pretrained directory for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Tokenizer

class EduNLP.Tokenizer.AstFormulaTokenizer(symbol='gmas', figures=None, **kwargs)[source]
class EduNLP.Tokenizer.CharTokenizer(stop_words='punctuations', **kwargs)[source]
class EduNLP.Tokenizer.CustomTokenizer(symbol='gmas', figures=None, **kwargs)[source]
class EduNLP.Tokenizer.PureTextTokenizer(handle_figure_formula='skip', **kwargs)[source]
class EduNLP.Tokenizer.SpaceTokenizer(stop_words='punctuations', **kwargs)[source]

Tokenize text by space. eg. “题目 内容” -> [“题目”, “内容”]

Parameters

stop_words (str, optional) – stop_words to skip, by default “default”

class EduNLP.Tokenizer.Tokenizer[source]

Iterator genetator for tokenization

EduNLP.Tokenizer.get_tokenizer(name, *args, **kwargs)[source]

It is a total interface to use difference tokenizer. :param name: the name of tokenizer, e.g. text, pure_text. :type name: str :param args: the parameters passed to tokenizer :param kwargs: the parameters passed to tokenizer

Returns

tokenizer

Return type

Tokenizer

Examples

>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokenizer = get_tokenizer("pure_text")
>>> tokens = tokenizer(items)
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']

Vector

class EduNLP.Vector.W2V(filepath, method=None, binary=None)[source]

The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.

Parameters
  • filepath – path to the pretrained model file

  • method (str) – fasttext other(Word2Vec)

  • binary (bool) –

key_to_index(word)[source]
property vectors
property vector_size
infer_vector(items, agg='mean', *args, **kwargs) list[source]
infer_tokens(items, *args, **kwargs) list[source]
class EduNLP.Vector.D2V(filepath, method='d2v')[source]

It is a collection which include d2v, bow, tfidf method.

Parameters
  • filepath

  • method (str) – d2v bow tfidf

  • item

Returns

d2v model

Return type

D2V

property vector_size
infer_vector(items, *args, **kwargs) list[source]
infer_tokens(item, *args, **kwargs) ...[source]
class EduNLP.Vector.BowLoader(filepath)[source]

Using doc2bow model, which has a lot of effects.

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.

If allow_update is not set, this function is const, aka read-only.

infer_vector(item, return_vec=False)[source]
property vector_size
class EduNLP.Vector.TfidfLoader(filepath)[source]

This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.

infer_vector(item, return_vec=False)[source]
property vector_size
class EduNLP.Vector.RNNModel(rnn_type, w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), hidden_size, freeze_pretrained=True, model_params=None, device=None, **kwargs)[source]

Examples

>>> model = RNNModel("BiLSTM", None, 2, vocab_size=4, embedding_dim=3)
>>> seq_idx = [[1, 2, 3], [1, 2, 0], [3, 0, 0]]
>>> output, hn = model(seq_idx, indexing=False, padding=False)
>>> seq_idx = [[1, 2, 3], [1, 2], [3]]
>>> output, hn = model(seq_idx, indexing=False, padding=True)
>>> output.shape
torch.Size([3, 3, 4])
>>> hn.shape
torch.Size([2, 3, 2])
>>> tokens = model.infer_tokens(seq_idx, indexing=False)
>>> tokens.shape
torch.Size([3, 3, 4])
>>> tokens = model.infer_tokens(seq_idx, agg="mean", indexing=False)
>>> tokens.shape
torch.Size([3, 4])
>>> item = model.infer_vector(seq_idx, indexing=False)
>>> item.shape
torch.Size([3, 4])
>>> item = model.infer_vector(seq_idx, agg="mean", indexing=False)
>>> item.shape
torch.Size([3, 2])
>>> item = model.infer_vector(seq_idx, agg=None, indexing=False)
>>> item.shape
torch.Size([2, 3, 2])
infer_vector(items, agg: (<class 'int'>, <class 'str'>, None) = -1, indexing=True, padding=True, *args, **kwargs) Tensor[source]
infer_tokens(items, agg=None, *args, **kwargs) Tensor[source]
property vector_size: int
set_device(device)[source]
save(filepath, save_embedding=False)[source]
freeze(*args, **kwargs)[source]
property is_frozen
eval()[source]
train(mode=True)[source]
class EduNLP.Vector.T2V(model: str, *args, **kwargs)[source]

The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.

Parameters

model (str) – select the model type e.g.: d2v, rnn, lstm, gru, elmo, etc.

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$,    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> model_dir = "examples/test_model/d2v"
>>> url, model_name, *args = get_pretrained_model_info('d2v_test_256')
>>> (); path = get_data(url, model_dir); () 
(...)
>>> path = path_append(path, os.path.basename(path) + '.bin', to_str=True)
>>> t2v = T2V('d2v',filepath=path)
>>> print(t2v(item))
[array([...dtype=float32)]
infer_vector(items, *args, **kwargs)[source]
infer_tokens(items, *args, **kwargs)[source]
property vector_size: int
EduNLP.Vector.get_pretrained_t2v(name, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]

It is a good idea if you want to switch token list to vector earily.

Parameters
  • name (str) – select the pretrained model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512

  • model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’

Returns

t2v model

Return type

T2V

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$,    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> i2v = get_pretrained_t2v("d2v_test_256", "examples/test_model/d2v") 
>>> print(i2v(item)) 
[array([...dtype=float32)]
EduNLP.Vector.get_pretrained_model_info(name)[source]
EduNLP.Vector.get_all_pretrained_models()[source]
class EduNLP.Vector.Embedding(w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), freeze=True, device=None, **kwargs)[source]
infer_token_vector(items: List[List[str]], indexing=True) tuple[source]
indexing(items: List[List[str]], padding=False, indexing=True) tuple[source]
Parameters
  • items (list of list of str(word/token)) –

  • padding (bool) – whether padding the returned list with default pad_val to make all item in items have the same length

  • indexing (bool) –

Returns

  • token_idx (list of list of int) – the list of the tokens of each item

  • token_len (list of int) – the list of the length of tokens of each item

set_device(device)[source]
class EduNLP.Vector.BertModel(pretrained_model)[source]

Examples

>>> from EduNLP.Pretrain import BertTokenizer
>>> tokenizer = BertTokenizer("bert-base-chinese", add_special_tokens=False)
>>> model = BertModel("bert-base-chinese")
>>> item = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,若$x,y$满足约束",
... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,若$x,y$满足约束"]
>>> inputs = tokenizer(item, return_tensors='pt')
>>> output = model(inputs)
>>> output.shape
torch.Size([2, 14, 768])
>>> tokens = model.infer_tokens(inputs)
>>> tokens.shape
torch.Size([2, 12, 768])
>>> tokens = model.infer_tokens(inputs, return_special_tokens=True)
>>> tokens.shape
torch.Size([2, 14, 768])
>>> item = model.infer_vector(inputs)
>>> item.shape
torch.Size([2, 768])
infer_vector(items: dict, pooling_strategy='CLS') Tensor[source]
infer_tokens(items: dict, return_special_tokens=False) Tensor[source]
property vector_size
class EduNLP.Vector.QuesNetModel(pretrained_dir, img_dir=None, device='cpu', **kwargs)[source]
infer_vector(items: Union[dict, list]) Tensor[source]

get question embedding with quesnet

Parameters

items – encodes from tokenizer

infer_tokens(items: Union[dict, list]) Tensor[source]

get token embeddings with quesnet

Parameters

items – encodes from tokenizer

Returns

word_embs + meta_emb

Return type

torch.Tensor

property vector_size
class EduNLP.Vector.DisenQModel(pretrained_dir, device='cpu')[source]
infer_vector(items: dict, vector_type=None, **kwargs) Tensor[source]
Parameters

vector_type (str) – choose the type of items tensor to return. Default is None, which means return both (k_hidden, i_hidden) when vector_type=”k”, return k_hidden; when vector_type=”i”, return i_hidden;

infer_tokens(items: dict, **kwargs) Tensor[source]
property vector_size
class EduNLP.Vector.ElmoModel(pretrained_dir: str)[source]
infer_vector(items: Tuple[dict, List[dict]], *args, **kwargs) Tensor[source]
infer_tokens(items, *args, **kwargs) Tensor[source]
property vector_size

Pipeline

EduNLP.Pipeline.pipeline(task: Optional[str] = None, model: Optional[Union[BaseModel, str]] = None, tokenizer: Optional[PretrainedEduTokenizer] = None, pipeline_class: Optional[Pipeline] = None, preprocess: Optional[List] = None, **kwargs)[source]
Parameters
  • task (str, required) –

  • model (BaseModel or str, optional) –

  • tokenizer (PretrainedEduTokenizer, optional) –

  • pipeline_class (Pipeline, optional) – to specify Pipeline class

  • preprocess (list, optional) – a list of names of pre-process pipes

Examples

>>> processor = pipeline(task="property-prediction") 
>>> item = "如图所示,则三角形ABC的面积是_。"
>>> processor(item)