EduNLP.Pretrain¶
EduNLP.Pretrain.pretrian_utils¶
- class EduNLP.Pretrain.pretrian_utils.EduVocab(vocab_path: Optional[str] = None, corpus_items: Optional[List[str]] = None, bos_token: str = '[BOS]', eos_token: str = '[EOS]', pad_token: str = '[PAD]', unk_token: str = '[UNK]', specials: Optional[List[str]] = None, lower: bool = False, trim_min_count: int = 1, **kwargs)[source]¶
The vocabulary container for a corpus.
- Parameters:
vocab_path (str, optional) – vocabulary path to initialize this container, by default None
corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None
bos_token (str, optional) – token representing for the start of a sentence, by default “[BOS]”
eos_token (str, optional) – token representing for the end of a sentence, by default “[EOS]”
pad_token (str, optional) – token representing for padding, by default “[PAD]”
unk_token (str, optional) – token representing for unknown word, by default “[UNK]”
specials (List[str], optional) – spacials tokens in vocabulary, by default None
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
- property vocab_size¶
- property special_tokens¶
- property tokens¶
- convert_sequence_to_idx(tokens, bos=False, eos=False)[source]¶
convert sentence of tokens to sentence of indexs
- set_vocab(corpus_items: List[str], lower: bool = False, trim_min_count: int = 1, silent=True)[source]¶
Update the vocabulary with the tokens in corpus items
- Parameters:
corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
- load_vocab(vocab_path: str)[source]¶
Load the vocabulary from vocab_file
- Parameters:
vocab_path (str) – path to save vocabulary file
- class EduNLP.Pretrain.pretrian_utils.EduDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]¶
The base class implements a Dataset, which package the datasets.Dataset and provide more convenience, including parallel preprocessing, offline loadding and so on.
- Parameters:
tokenizer – PretrainedEduTokenizer or model-specific Pretrained Tokenizer
ds_disk_path (HFDataset, optional) – the dataset_path to save dataset used by datasets.Dataset, by default None
items (Union[List[dict], List[str]], optional) – input items to process, by default None
stem_key (str, optional) – the content of items to process, by default “text”
label_key (Optional[str], optional) – the labels of items to process, by default None
feature_keys (Optional[List[str]], optional) – the additional features of items to remain, by default None
num_processor (int, optional) – specific the number of cpus for parallel speedup, by default None
- ds¶
map will break down for super large data which is greater than 4GB
- Type:
Note
- class EduNLP.Pretrain.pretrian_utils.PretrainedEduTokenizer(vocab_path: Optional[str] = None, max_length: int = 250, tokenize_method: str = 'pure_text', add_specials: Tuple[list, bool] = False, **kwargs)[source]¶
This base class is in charge of preparing the inputs for a model
- Parameters:
vocab_path (str, optional) – _description_, by default None
max_length (int, optional) – used to clip the sentence out of max_length, by default None
tokenize_method (str, optional) – default: “space” - when text is already seperated by space, use “space” - when text is raw string format, use Tokenizer defined in get_tokenizer(), such as “pure_text” and “text”
add_specials (Tuple[list, bool], optional) – by default None - For bool, it means whether to add EDU_SPYMBOLS to vocabulary - For list, it means the added special tokens besides EDU_SPYMBOLS
- tokenize(items: ~typing.Tuple[list, str, dict], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]¶
- Parameters:
items (list or str or dict) – the question items
key (function) – determine how to get the text of each item
- Returns:
tokens – the token of items
- Return type:
list
- encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]¶
- classmethod from_pretrained(tokenizer_config_dir: str, **kwargs)[source]¶
Load tokenizer from local files
Parameters:¶
- tokenizer_config_dir: str
The dir path containing tokenizer_config.json and vocab.list
- save_pretrained(tokenizer_config_dir: str)[source]¶
Save tokenizer into local files
Parameters:¶
- tokenizer_config_dir: str
save tokenizer params in /tokenizer_config.json and save words in /vocab.list
- property vocab_size¶
- set_vocab(items: list, key=<function PretrainedEduTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True)[source]¶
Update the vocabulary with the tokens in corpus items
- Parameters:
items (list) – can be the list of str, or list of dict
key (function, optional) – determine how to get the text of each item
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True
- Returns:
token_items
- Return type:
list
EduNLP.Pretrain.hugginface_utils¶
- class EduNLP.Pretrain.hugginface_utils.TokenizerForHuggingface(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]¶
Parameterss¶
- pretrained_model:
used pretrained model
- add_specials:
Whether to add tokens like [FIGURE], [TAG], etc.
- tokenize_method:
Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.
Examples
>>> tokenizer = TokenizerForHuggingface(add_special_tokens=True) >>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$" >>> token_item = tokenizer(item) >>> print(token_item.input_ids[:10]) tensor([[ 101, 1062, 2466, 1963, 1745, 138, 100, 140, 166, 117, 167, 5276, 3338, 3340, 816, 1062, 2466, 102, 168, 134, 166, 116, 128, 167, 3297, 1920, 966, 138, 100, 140, 102]]) >>> print(tokenizer.tokenize(item)[:10]) ['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y'] >>> items = [item, item] >>> token_items = tokenizer(items, return_tensors='pt') >>> print(token_items.input_ids.shape) torch.Size([2, 31]) >>> print(len(tokenizer.tokenize(items))) 2 >>> tokenizer.save_pretrained('test_dir') >>> tokenizer = TokenizerForHuggingface.from_pretrained('test_dir')
- tokenize(items: ~typing.Union[list, str, dict], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]¶
- encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]¶
- property vocab_size¶
- set_vocab(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, lower=False, trim_min_count: int = 1, do_tokenize: bool = True)[source]¶
- Parameters:
items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True
EduNLP.Pretrain.gensim_vec¶
- class EduNLP.Pretrain.gensim_vec.GensimWordTokenizer(symbol='gm', general=False)[source]¶
- Parameters:
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g.: gm, fgm, gmas, fgmas
general (bool) –
True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.
False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.
- Returns:
tokenizer
- Return type:
Examples
>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] >>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
- EduNLP.Pretrain.gensim_vec.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]¶
- Parameters:
items:str – the text of question
w2v_prefix –
embedding_dim (int) – vector_size
method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf
binary (model format) – True:bin; False:kv
train_params (dict) – the training parameters passed to model
- Returns:
tokenizer
- Return type:
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'
- class EduNLP.Pretrain.gensim_vec.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]¶
- Parameters:
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g. gms, fgm
depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;
- Returns:
tokenizer
- Return type:
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]
EduNLP.Pretrain.elmo_vec¶
- class EduNLP.Pretrain.elmo_vec.ElmoTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials=True, **kwargs)[source]¶
Examples
>>> t=ElmoTokenizer() >>> items = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,\ ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"] >>> len(t) 14 >>> t.tokenize(items[0]) ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]'] >>> t(items[0]) {'seq_idx': tensor([1, 1, 6, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 7]), 'seq_len': tensor(17)} >>> t.set_vocab(items[0]) ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]'] >>> len(t) 45 >>> t(items[0]) {'seq_idx': tensor([ 1, 1, 6, 26, 27, 28, 1, 1, 9, 35, 36, 26, 37, 38, 28, 1, 7]), 'seq_len': tensor(17)}
- class EduNLP.Pretrain.elmo_vec.ElmoDataset(tokenizer: ElmoTokenizer, **kwargs)[source]¶
- EduNLP.Pretrain.elmo_vec.train_elmo(items: Union[List[dict], List[str]], output_dir: str, pretrained_dir: Optional[str] = None, tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]¶
- Parameters:
items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) –
stem_key
label_key
The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
- EduNLP.Pretrain.elmo_vec.train_elmo_for_property_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters:
train_items (list, required) – The training items, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
- EduNLP.Pretrain.elmo_vec.train_elmo_for_knowledge_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters:
train_items (list, required) – The training items, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
EduNLP.Pretrain.bert_vec¶
- class EduNLP.Pretrain.bert_vec.BertTokenizer(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]¶
Examples
>>> tokenizer = BertTokenizer(add_special_tokens=True) >>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$" >>> token_item = tokenizer(item) >>> print(token_item.input_ids) tensor([[ 101, 1062, 2466, 1963, 1745, 138, 100, 140, 166, 117, 167, 5276, 3338, 3340, 816, 1062, 2466, 102, 168, 134, 166, 116, 128, 167, 3297, 1920, 966, 138, 100, 140, 102]]) >>> print(tokenizer.tokenize(item)[:10]) ['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y'] >>> items = [item, item] >>> token_items = tokenizer(items, return_tensors='pt') >>> print(token_items.input_ids.shape) torch.Size([2, 31]) >>> print(len(tokenizer.tokenize(items))) 2 >>> tokenizer.save_pretrained('test_dir') >>> tokenizer = BertTokenizer.from_pretrained('test_dir')
- class EduNLP.Pretrain.bert_vec.BertDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]¶
- EduNLP.Pretrain.bert_vec.finetune_bert(items: Union[List[dict], List[str]], output_dir: str, pretrained_model='bert-base-chinese', tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]¶
- Parameters:
items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
Examples
>>> stems = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$", ... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$"] >>> finetune_bert(stems, "examples/test_model/data/data/bert") {'train_runtime': ..., ..., 'epoch': 1.0}
- EduNLP.Pretrain.bert_vec.finetune_bert_for_property_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters:
train_items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
- EduNLP.Pretrain.bert_vec.finetune_bert_for_knowledge_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters:
train_items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
EduNLP.Pretrain.disenqnet_vec¶
- class EduNLP.Pretrain.disenqnet_vec.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials: Optional[list] = None, num_token='[NUM]', **kwargs)[source]¶
Examples
>>> tokenizer = DisenQTokenizer() >>> test_items = [{ ... "content": "甲 数 除以 乙 数 的 商 是 1.5 , 如果 甲 数 增加 20 , 则 甲 数 是 乙 的 4 倍 . 原来 甲 数 = .", ... "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}] >>> tokenizer.set_vocab(test_items, ... key=lambda x: x["content"], trim_min_count=1) [['甲', '数', '除以', '乙', '数', '商', '[NUM]', '甲', '数', '增加', '[NUM]', '甲', '数', '乙', '倍', '甲', '数']] >>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items] >>> print(token_items[0].keys()) dict_keys(['seq_idx', 'seq_len'])
- class EduNLP.Pretrain.disenqnet_vec.DisenQDataset(items: List[Dict], tokenizer: DisenQTokenizer, data_formation: Dict, mode='train', concept_to_idx=None, **kwargs)[source]¶
- EduNLP.Pretrain.disenqnet_vec.train_disenqnet(train_items: List[dict], output_dir: str, pretrained_dir: Optional[str] = None, eval_items=None, tokenizer_params=None, data_params=None, model_params=None, train_params=None, w2v_params=None)[source]¶
- Parameters:
train_items (List[dict]) – _description_
output_dir (str) – _description_
pretrained_dir (str, optional) – _description_, by default None
tokenizer_params (_type_, optional) – _description_, by default None
data_params (_type_, optional) – _description_, by default None
model_params (_type_, optional) – _description_, by default None
train_params (_type_, optional) – _description_, by default None
- EduNLP.Pretrain.disenqnet_vec.finetune_disenqnet_for_property_prediction(train_items, output_dir, pretrained_model, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters:
train_items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
- EduNLP.Pretrain.disenqnet_vec.finetune_disenqnet_for_knowledge_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters:
train_items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
EduNLP.Pretrain.quesnet_vec¶
Pre-process input text, tokenizing, building vocabs, and pre-train word level vectors.
- class EduNLP.Pretrain.quesnet_vec.Question(id, content, answer, false_options, labels)¶
- property answer¶
Alias for field number 2
- property content¶
Alias for field number 1
- property false_options¶
Alias for field number 3
- property id¶
Alias for field number 0
- property labels¶
Alias for field number 4
- class EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer(vocab_path=None, meta_vocab_dir=None, img_dir: Optional[str] = None, max_length=250, tokenize_method='custom', symbol='mas', add_specials: Optional[list] = None, meta: Optional[List[str]] = None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', **kwargs)[source]¶
Examples
>>> tokenizer = QuesNetTokenizer(meta=['knowledge']) >>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$", ... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B", ... "knowledge": "['*', '-', '/']"}] >>> tokenizer.set_vocab(test_items, ... trim_min_count=1, key=lambda x: x["ques_content"], silent=True) >>> tokenizer.set_meta_vocab(test_items, silent=True) >>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items] >>> print(token_items[0].keys()) dict_keys(['seq_idx', 'meta_idx']) >>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"]) >>> print(len(token_items["seq_idx"])) 2
- set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True, silent=True)[source]¶
- Parameters:
items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
silent –
- classmethod from_pretrained(tokenizer_config_dir, img_dir=None, **kwargs)[source]¶
Parameters:¶
- tokenizer_config_dir: str
must contain tokenizer_config.json and vocab.txt and meta_{meta_name}.txt
- img_dir: str
default None the path of image directory
- class EduNLP.Pretrain.quesnet_vec.QuestionLoader(ques: ~EduNLP.Pretrain.quesnet_vec.Lines, tokenizer: ~EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer, pipeline=None, range=None, meta: ~typing.Optional[list] = None, content_key=<function QuestionLoader.<lambda>>, meta_key=<function QuestionLoader.<lambda>>, answer_key=<function QuestionLoader.<lambda>>, option_key=<function QuestionLoader.<lambda>>, skip=0)[source]¶
- class EduNLP.Pretrain.quesnet_vec.PrefetchIter(data, *label, length=None, batch_size=1, shuffle=True)[source]¶
Iterator on data and labels, with states for save and restore.
- EduNLP.Pretrain.quesnet_vec.pretrain_embedding_layer(dataset: EmbeddingDataset, ae: AE, lr: float = 0.001, log_step: int = 1, epochs: int = 3, batch_size: int = 4, device=device(type='cpu'))[source]¶
- EduNLP.Pretrain.quesnet_vec.pretrain_quesnet(path, output_dir, img_dir=None, save_embs=False, train_params=None)[source]¶
pretrain quesnet
- Parameters:
path (str) – path of question file
output_dir (str) – output path·
tokenizer (QuesNetTokenizer) – quesnet tokenizer
save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately
train_params (dict, optional) –
the training parameters and model parameters, by default None - “n_epochs”: int, default = 1
train param, number of epochs
- ”batch_size”: int, default = 6
train param, batch size
- ”lr”: float, default = 1e-3
train param, learning rate
- ”save_every”: int, default = 0
train param, save steps interval
- ”log_steps”: int, default = 10
train param, log steps interval
- ”device”: str, default = ‘cpu’
train param, ‘cpu’ or ‘cuda’
- ”max_steps”: int, default = 0
train param, stop training when reach max steps
- ”emb_size”: int, default = 256
model param, the embedding size of word, figure, meta info
- ”feat_size”: int, default = 256
model param, the size of question infer vector
Examples
>>> tokenizer = QuesNetTokenizer(meta=['know_name']) >>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$,则$|z|=$,$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$", ... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb", ... "know_name": "['代数', '集合', '集合的相等']" ... }] >>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True) >>> tokenizer.set_meta_vocab(items, silent=True) >>> pretrain_quesnet('./data/standard_luna_data.json', './testQuesNet', tokenizer)