EduNLP.Pretrain

EduNLP.Pretrain.pretrian_utils

class EduNLP.Pretrain.pretrian_utils.EduVocab(vocab_path: Optional[str] = None, corpus_items: Optional[List[str]] = None, bos_token: str = '[BOS]', eos_token: str = '[EOS]', pad_token: str = '[PAD]', unk_token: str = '[UNK]', specials: Optional[List[str]] = None, lower: bool = False, trim_min_count: int = 1, **kwargs)[source]

The vocabulary container for a corpus.

Parameters:
  • vocab_path (str, optional) – vocabulary path to initialize this container, by default None

  • corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None

  • bos_token (str, optional) – token representing for the start of a sentence, by default “[BOS]”

  • eos_token (str, optional) – token representing for the end of a sentence, by default “[EOS]”

  • pad_token (str, optional) – token representing for padding, by default “[PAD]”

  • unk_token (str, optional) – token representing for unknown word, by default “[UNK]”

  • specials (List[str], optional) – spacials tokens in vocabulary, by default None

  • lower (bool, optional) – wheather to lower the corpus items, by default False

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

property vocab_size
property special_tokens
property tokens
to_idx(token)[source]

convert token to index

to_token(idx)[source]

convert index to index

convert_sequence_to_idx(tokens, bos=False, eos=False)[source]

convert sentence of tokens to sentence of indexs

convert_sequence_to_token(idxs, **kwargs)[source]

convert sentence of indexs to sentence of tokens

set_vocab(corpus_items: List[str], lower: bool = False, trim_min_count: int = 1, silent=True)[source]

Update the vocabulary with the tokens in corpus items

Parameters:
  • corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None

  • lower (bool, optional) – wheather to lower the corpus items, by default False

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

load_vocab(vocab_path: str)[source]

Load the vocabulary from vocab_file

Parameters:

vocab_path (str) – path to save vocabulary file

save_vocab(vocab_path: str)[source]

Save the vocabulary into vocab_file

Parameters:

vocab_path (str) – path to save vocabulary file

add_specials(tokens: List[str])[source]

Add special tokens into vocabulary

add_tokens(tokens: List[str])[source]

Add tokens into vocabulary

class EduNLP.Pretrain.pretrian_utils.EduDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]

The base class implements a Dataset, which package the datasets.Dataset and provide more convenience, including parallel preprocessing, offline loadding and so on.

Parameters:
  • tokenizer – PretrainedEduTokenizer or model-specific Pretrained Tokenizer

  • ds_disk_path (HFDataset, optional) – the dataset_path to save dataset used by datasets.Dataset, by default None

  • items (Union[List[dict], List[str]], optional) – input items to process, by default None

  • stem_key (str, optional) – the content of items to process, by default “text”

  • label_key (Optional[str], optional) – the labels of items to process, by default None

  • feature_keys (Optional[List[str]], optional) – the additional features of items to remain, by default None

  • num_processor (int, optional) – specific the number of cpus for parallel speedup, by default None

ds

map will break down for super large data which is greater than 4GB

Type:

Note

to_disk(ds_disk_path)[source]

Save the processed dataset into local files

collect_fn()[source]
class EduNLP.Pretrain.pretrian_utils.PretrainedEduTokenizer(vocab_path: Optional[str] = None, max_length: int = 250, tokenize_method: str = 'pure_text', add_specials: Tuple[list, bool] = False, **kwargs)[source]

This base class is in charge of preparing the inputs for a model

Parameters:
  • vocab_path (str, optional) – _description_, by default None

  • max_length (int, optional) – used to clip the sentence out of max_length, by default None

  • tokenize_method (str, optional) – default: “space” - when text is already seperated by space, use “space” - when text is raw string format, use Tokenizer defined in get_tokenizer(), such as “pure_text” and “text”

  • add_specials (Tuple[list, bool], optional) – by default None - For bool, it means whether to add EDU_SPYMBOLS to vocabulary - For list, it means the added special tokens besides EDU_SPYMBOLS

tokenize(items: ~typing.Tuple[list, str, dict], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]
Parameters:
  • items (list or str or dict) – the question items

  • key (function) – determine how to get the text of each item

Returns:

tokens – the token of items

Return type:

list

encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]
decode(token_ids: list, key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]
classmethod from_pretrained(tokenizer_config_dir: str, **kwargs)[source]

Load tokenizer from local files

Parameters:

tokenizer_config_dir: str

The dir path containing tokenizer_config.json and vocab.list

save_pretrained(tokenizer_config_dir: str)[source]

Save tokenizer into local files

Parameters:

tokenizer_config_dir: str

save tokenizer params in /tokenizer_config.json and save words in /vocab.list

property vocab_size
set_vocab(items: list, key=<function PretrainedEduTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True)[source]

Update the vocabulary with the tokens in corpus items

Parameters:
  • items (list) – can be the list of str, or list of dict

  • key (function, optional) – determine how to get the text of each item

  • lower (bool, optional) – wheather to lower the corpus items, by default False

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

  • do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True

Returns:

token_items

Return type:

list

add_specials(tokens)[source]

Add special tokens into vocabulary

add_tokens(tokens)[source]

Add tokens into vocabulary

EduNLP.Pretrain.hugginface_utils

class EduNLP.Pretrain.hugginface_utils.TokenizerForHuggingface(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]

Parameterss

pretrained_model:

used pretrained model

add_specials:

Whether to add tokens like [FIGURE], [TAG], etc.

tokenize_method:

Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.

Examples

>>> tokenizer = TokenizerForHuggingface(add_special_tokens=True)
>>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"
>>> token_item = tokenizer(item)
>>> print(token_item.input_ids[:10])
tensor([[ 101, 1062, 2466, 1963, 1745,  138,  100,  140,  166,  117,  167, 5276,
         3338, 3340,  816, 1062, 2466,  102,  168,  134,  166,  116,  128,  167,
         3297, 1920,  966,  138,  100,  140,  102]])
>>> print(tokenizer.tokenize(item)[:10])
['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y']
>>> items = [item, item]
>>> token_items = tokenizer(items, return_tensors='pt')
>>> print(token_items.input_ids.shape)
torch.Size([2, 31])
>>> print(len(tokenizer.tokenize(items)))
2
>>> tokenizer.save_pretrained('test_dir') 
>>> tokenizer = TokenizerForHuggingface.from_pretrained('test_dir') 
tokenize(items: ~typing.Union[list, str, dict], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]
encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]
decode(token_ids: list, key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]
classmethod from_pretrained(tokenizer_config_dir, **kwargs)[source]
save_pretrained(tokenizer_config_dir)[source]
property vocab_size
set_vocab(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, lower=False, trim_min_count: int = 1, do_tokenize: bool = True)[source]
Parameters:
  • items (list) – can be the list of str, or list of dict

  • key (function) – determine how to get the text of each item

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

  • do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True

add_specials(added_spectials: List[str])[source]
add_tokens(added_tokens: List[str])[source]

EduNLP.Pretrain.gensim_vec

class EduNLP.Pretrain.gensim_vec.GensimWordTokenizer(symbol='gm', general=False)[source]
Parameters:
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g.: gm, fgm, gmas, fgmas

  • general (bool) –

    True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.

    False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.

Returns:

tokenizer

Return type:

Tokenizer

Examples

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
batch_process(*items)[source]
EduNLP.Pretrain.gensim_vec.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]
Parameters:
  • items:str – the text of question

  • w2v_prefix

  • embedding_dim (int) – vector_size

  • method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf

  • binary (model format) – True:bin; False:kv

  • train_params (dict) – the training parameters passed to model

Returns:

tokenizer

Return type:

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 
'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'
class EduNLP.Pretrain.gensim_vec.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]
Parameters:
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g. gms, fgm

  • depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;

Returns:

tokenizer

Return type:

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]

EduNLP.Pretrain.elmo_vec

class EduNLP.Pretrain.elmo_vec.ElmoTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials=True, **kwargs)[source]

Examples

>>> t=ElmoTokenizer()
>>> items = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"]
>>> len(t)
14
>>> t.tokenize(items[0])
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> t(items[0])
{'seq_idx': tensor([1, 1, 6, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 7]), 'seq_len': tensor(17)}
>>> t.set_vocab(items[0])
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> len(t)
45
>>> t(items[0])
{'seq_idx': tensor([ 1,  1,  6, 26, 27, 28,  1,  1,  9, 35, 36, 26, 37, 38, 28,  1,  7]), 'seq_len': tensor(17)}
class EduNLP.Pretrain.elmo_vec.ElmoDataset(tokenizer: ElmoTokenizer, **kwargs)[source]
collate_fn(batch_data)[source]
EduNLP.Pretrain.elmo_vec.train_elmo(items: Union[List[dict], List[str]], output_dir: str, pretrained_dir: Optional[str] = None, tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]
Parameters:
  • items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_dir (str, optional) – The pretrained directory for model and tokenizer

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) –

    • stem_key

    • label_key

    The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.elmo_vec.train_elmo_for_property_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters:
  • train_items (list, required) – The training items, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_dir (str, optional) – The pretrained directory for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.elmo_vec.train_elmo_for_knowledge_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters:
  • train_items (list, required) – The training items, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_dir (str, optional) – The pretrained directory for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.bert_vec

class EduNLP.Pretrain.bert_vec.BertTokenizer(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]

Examples

>>> tokenizer = BertTokenizer(add_special_tokens=True)
>>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"
>>> token_item = tokenizer(item)
>>> print(token_item.input_ids)
tensor([[ 101, 1062, 2466, 1963, 1745,  138,  100,  140,  166,  117,  167, 5276,
         3338, 3340,  816, 1062, 2466,  102,  168,  134,  166,  116,  128,  167,
         3297, 1920,  966,  138,  100,  140,  102]])
>>> print(tokenizer.tokenize(item)[:10])
['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y']
>>> items = [item, item]
>>> token_items = tokenizer(items, return_tensors='pt')
>>> print(token_items.input_ids.shape)
torch.Size([2, 31])
>>> print(len(tokenizer.tokenize(items)))
2
>>> tokenizer.save_pretrained('test_dir') 
>>> tokenizer = BertTokenizer.from_pretrained('test_dir') 
class EduNLP.Pretrain.bert_vec.BertDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]
EduNLP.Pretrain.bert_vec.finetune_bert(items: Union[List[dict], List[str]], output_dir: str, pretrained_model='bert-base-chinese', tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]
Parameters:
  • items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

Examples

>>> stems = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$",
... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$"]
>>> finetune_bert(stems, "examples/test_model/data/data/bert") 
{'train_runtime': ..., ..., 'epoch': 1.0}
EduNLP.Pretrain.bert_vec.finetune_bert_for_property_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters:
  • train_items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.bert_vec.finetune_bert_for_knowledge_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters:
  • train_items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.disenqnet_vec

class EduNLP.Pretrain.disenqnet_vec.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials: Optional[list] = None, num_token='[NUM]', **kwargs)[source]

Examples

>>> tokenizer = DisenQTokenizer()
>>> test_items = [{
...     "content": "甲 数 除以 乙 数 的 商 是 1.5 , 如果 甲 数 增加 20 , 则 甲 数 是 乙 的 4 倍 . 原来 甲 数 = .",
...     "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}]
>>> tokenizer.set_vocab(test_items,
...     key=lambda x: x["content"], trim_min_count=1)
[['甲', '数', '除以', '乙', '数', '商', '[NUM]', '甲', '数', '增加', '[NUM]', '甲', '数', '乙', '倍', '甲', '数']]
>>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['seq_idx', 'seq_len'])
class EduNLP.Pretrain.disenqnet_vec.DisenQDataset(items: List[Dict], tokenizer: DisenQTokenizer, data_formation: Dict, mode='train', concept_to_idx=None, **kwargs)[source]
collate_fn(batch_data)[source]
EduNLP.Pretrain.disenqnet_vec.train_disenqnet(train_items: List[dict], output_dir: str, pretrained_dir: Optional[str] = None, eval_items=None, tokenizer_params=None, data_params=None, model_params=None, train_params=None, w2v_params=None)[source]
Parameters:
  • train_items (List[dict]) – _description_

  • output_dir (str) – _description_

  • pretrained_dir (str, optional) – _description_, by default None

  • tokenizer_params (_type_, optional) – _description_, by default None

  • data_params (_type_, optional) – _description_, by default None

  • model_params (_type_, optional) – _description_, by default None

  • train_params (_type_, optional) – _description_, by default None

EduNLP.Pretrain.disenqnet_vec.finetune_disenqnet_for_property_prediction(train_items, output_dir, pretrained_model, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters:
  • train_items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.disenqnet_vec.finetune_disenqnet_for_knowledge_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters:
  • train_items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.quesnet_vec

Pre-process input text, tokenizing, building vocabs, and pre-train word level vectors.

class EduNLP.Pretrain.quesnet_vec.Question(id, content, answer, false_options, labels)
property answer

Alias for field number 2

property content

Alias for field number 1

property false_options

Alias for field number 3

property id

Alias for field number 0

property labels

Alias for field number 4

EduNLP.Pretrain.quesnet_vec.save_list(item2index, path)[source]
class EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer(vocab_path=None, meta_vocab_dir=None, img_dir: Optional[str] = None, max_length=250, tokenize_method='custom', symbol='mas', add_specials: Optional[list] = None, meta: Optional[List[str]] = None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', **kwargs)[source]

Examples

>>> tokenizer = QuesNetTokenizer(meta=['knowledge'])
>>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$",
... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B",
... "knowledge": "['*', '-', '/']"}]
>>> tokenizer.set_vocab(test_items,
... trim_min_count=1, key=lambda x: x["ques_content"], silent=True)
>>> tokenizer.set_meta_vocab(test_items, silent=True)
>>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['seq_idx', 'meta_idx'])
>>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"])
>>> print(len(token_items["seq_idx"]))
2
load_meta_vocab(meta_vocab_dir)[source]
set_meta_vocab(items: list, meta: Optional[List[str]] = None, silent=True)[source]
set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True, silent=True)[source]
Parameters:
  • items (list) – can be the list of str, or list of dict

  • key (function) – determine how to get the text of each item

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

  • silent

classmethod from_pretrained(tokenizer_config_dir, img_dir=None, **kwargs)[source]

Parameters:

tokenizer_config_dir: str

must contain tokenizer_config.json and vocab.txt and meta_{meta_name}.txt

img_dir: str

default None the path of image directory

save_pretrained(tokenizer_config_dir)[source]

Save tokenizer into local files

Parameters:

tokenizer_config_dir: str

save tokenizer params in /tokenizer_config.json and save words in vocab.txt and save metas in meta_{meta_name}.txt

padding(idx, max_length, type='word')[source]
set_img_dir(path)[source]
EduNLP.Pretrain.quesnet_vec.clip(v, low, high)[source]
class EduNLP.Pretrain.quesnet_vec.Lines(filename, skip=0, preserve_newline=False)[source]
class EduNLP.Pretrain.quesnet_vec.QuestionLoader(ques: ~EduNLP.Pretrain.quesnet_vec.Lines, tokenizer: ~EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer, pipeline=None, range=None, meta: ~typing.Optional[list] = None, content_key=<function QuestionLoader.<lambda>>, meta_key=<function QuestionLoader.<lambda>>, answer_key=<function QuestionLoader.<lambda>>, option_key=<function QuestionLoader.<lambda>>, skip=0)[source]
split_(split_ratio)[source]
EduNLP.Pretrain.quesnet_vec.optimizer(*models, **kwargs)[source]
class EduNLP.Pretrain.quesnet_vec.PrefetchIter(data, *label, length=None, batch_size=1, shuffle=True)[source]

Iterator on data and labels, with states for save and restore.

produce()[source]
class EduNLP.Pretrain.quesnet_vec.EmbeddingDataset(data, data_type='image')[source]
EduNLP.Pretrain.quesnet_vec.pretrain_iter(ques, batch_size)[source]
EduNLP.Pretrain.quesnet_vec.critical(f)[source]
EduNLP.Pretrain.quesnet_vec.pretrain_embedding_layer(dataset: EmbeddingDataset, ae: AE, lr: float = 0.001, log_step: int = 1, epochs: int = 3, batch_size: int = 4, device=device(type='cpu'))[source]
EduNLP.Pretrain.quesnet_vec.pretrain_quesnet(path, output_dir, img_dir=None, save_embs=False, train_params=None)[source]

pretrain quesnet

Parameters:
  • path (str) – path of question file

  • output_dir (str) – output path·

  • tokenizer (QuesNetTokenizer) – quesnet tokenizer

  • save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately

  • train_params (dict, optional) –

    the training parameters and model parameters, by default None - “n_epochs”: int, default = 1

    train param, number of epochs

    • ”batch_size”: int, default = 6

      train param, batch size

    • ”lr”: float, default = 1e-3

      train param, learning rate

    • ”save_every”: int, default = 0

      train param, save steps interval

    • ”log_steps”: int, default = 10

      train param, log steps interval

    • ”device”: str, default = ‘cpu’

      train param, ‘cpu’ or ‘cuda’

    • ”max_steps”: int, default = 0

      train param, stop training when reach max steps

    • ”emb_size”: int, default = 256

      model param, the embedding size of word, figure, meta info

    • ”feat_size”: int, default = 256

      model param, the size of question infer vector

Examples

>>> tokenizer = QuesNetTokenizer(meta=['know_name'])
>>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$,则$|z|=$,$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$",
... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb",
... "know_name": "['代数', '集合', '集合的相等']"
... }]
>>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True)
>>> tokenizer.set_meta_vocab(items, silent=True)
>>> pretrain_quesnet('./data/standard_luna_data.json', './testQuesNet', tokenizer)