EduNLP.Vector¶

EduNLP.Vector.t2v¶

class EduNLP.Vector.t2v.T2V(model: str, *args, **kwargs)[source]¶

The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.

Parameters: model (str) – select the model type e.g.: d2v, rnn, lstm, gru, elmo, etc.

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$，    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> model_dir = "examples/test_model/d2v"
>>> url, model_name, *args = get_pretrained_model_info('d2v_test_256')
>>> (); path = get_data(url, model_dir); () 
(...)
>>> path = path_append(path, os.path.basename(path) + '.bin', to_str=True)
>>> t2v = T2V('d2v',filepath=path)
>>> print(t2v(item))
[array([...dtype=float32)]

infer_vector(items, *args, **kwargs)[source]¶

infer_tokens(items, *args, **kwargs)[source]¶

property vector_size: int¶

EduNLP.Vector.t2v.get_pretrained_model_info(name)[source]¶

EduNLP.Vector.t2v.get_all_pretrained_models()[source]¶

EduNLP.Vector.t2v.get_pretrained_t2v(name, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]¶

It is a good idea if you want to switch token list to vector earily.

Parameters

name (str) – select the pretrained model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512
model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’

Returns

t2v model

Return type

T2V

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$，    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> i2v = get_pretrained_t2v("d2v_test_256", "examples/test_model/d2v") 
>>> print(i2v(item)) 
[array([...dtype=float32)]

EduNLP.Vector.disenqnet¶

class EduNLP.Vector.disenqnet.disenqnet.DisenQModel(pretrained_dir, device='cpu')[source]¶

infer_vector(items: dict, vector_type=None, **kwargs) → Tensor[source]¶

Parameters: vector_type (str) – choose the type of items tensor to return. Default is None, which means return both (k_hidden, i_hidden) when vector_type=”k”, return k_hidden; when vector_type=”i”, return i_hidden;

infer_tokens(items: dict, **kwargs) → Tensor[source]¶

property vector_size¶

EduNLP.Vector.quesnet¶

class EduNLP.Vector.quesnet.quesnet.QuesNetModel(pretrained_dir, img_dir=None, device='cpu', **kwargs)[source]¶

infer_vector(items: Union[dict, list]) → Tensor[source]¶

get question embedding with quesnet

Parameters: items – encodes from tokenizer

infer_tokens(items: Union[dict, list]) → Tensor[source]¶

get token embeddings with quesnet

Parameters: items – encodes from tokenizer
Returns: word_embs + meta_emb
Return type: torch.Tensor

property vector_size¶

EduNLP.Vector.elmo_vec¶

class EduNLP.Vector.elmo_vec.ElmoModel(pretrained_dir: str)[source]¶

infer_vector(items: Tuple[dict, List[dict]], *args, **kwargs) → Tensor[source]¶

infer_tokens(items, *args, **kwargs) → Tensor[source]¶

property vector_size¶

EduNLP.Vector.gensim_vec¶

class EduNLP.Vector.gensim_vec.W2V(filepath, method=None, binary=None)[source]¶

The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.

Parameters

filepath – path to the pretrained model file
method (str) – fasttext other(Word2Vec)
binary (bool) –

key_to_index(word)[source]¶

property vectors¶

property vector_size¶

infer_vector(items, agg='mean', *args, **kwargs) → list[source]¶

infer_tokens(items, *args, **kwargs) → list[source]¶

class EduNLP.Vector.gensim_vec.BowLoader(filepath)[source]¶

Using doc2bow model, which has a lot of effects.

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.

If allow_update is not set, this function is const, aka read-only.

infer_vector(item, return_vec=False)[source]¶

property vector_size¶

class EduNLP.Vector.gensim_vec.TfidfLoader(filepath)[source]¶

This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.

infer_vector(item, return_vec=False)[source]¶

property vector_size¶

class EduNLP.Vector.gensim_vec.D2V(filepath, method='d2v')[source]¶

It is a collection which include d2v, bow, tfidf method.

Parameters

filepath –
method (str) – d2v bow tfidf
item –

Returns

d2v model

Return type

D2V

property vector_size¶

infer_vector(items, *args, **kwargs) → list[source]¶

infer_tokens(item, *args, **kwargs) → ...[source]¶

EduNLP.Vector.embedding¶

class EduNLP.Vector.embedding.Embedding(w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), freeze=True, device=None, **kwargs)[source]¶

infer_token_vector(items: List[List[str]], indexing=True) → tuple[source]¶

indexing(items: List[List[str]], padding=False, indexing=True) → tuple[source]¶

Parameters

items (list of list of str(word/token)) –
padding (bool) – whether padding the returned list with default pad_val to make all item in items have the same length
indexing (bool) –

Returns

token_idx (list of list of int) – the list of the tokens of each item
token_len (list of int) – the list of the length of tokens of each item

set_device(device)[source]¶