EduNLP.Vector

EduNLP.Vector.t2v

class EduNLP.Vector.t2v.T2V(model: str, *args, **kwargs)[source]

The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.

Parameters

model (str) – select the model type e.g.: d2v, rnn, lstm, gru, elmo, etc.

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$,    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> model_dir = "examples/test_model/d2v"
>>> url, model_name, *args = get_pretrained_model_info('d2v_test_256')
>>> (); path = get_data(url, model_dir); () 
(...)
>>> path = path_append(path, os.path.basename(path) + '.bin', to_str=True)
>>> t2v = T2V('d2v',filepath=path)
>>> print(t2v(item))
[array([...dtype=float32)]
infer_vector(items, *args, **kwargs)[source]
infer_tokens(items, *args, **kwargs)[source]
property vector_size: int
EduNLP.Vector.t2v.get_pretrained_model_info(name)[source]
EduNLP.Vector.t2v.get_all_pretrained_models()[source]
EduNLP.Vector.t2v.get_pretrained_t2v(name, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]

It is a good idea if you want to switch token list to vector earily.

Parameters
  • name (str) – select the pretrained model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512

  • model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’

Returns

t2v model

Return type

T2V

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$,    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> i2v = get_pretrained_t2v("d2v_test_256", "examples/test_model/d2v") 
>>> print(i2v(item)) 
[array([...dtype=float32)]

EduNLP.Vector.disenqnet

class EduNLP.Vector.disenqnet.disenqnet.DisenQModel(pretrained_dir, device='cpu')[source]
infer_vector(items: dict, vector_type=None, **kwargs) Tensor[source]
Parameters

vector_type (str) – choose the type of items tensor to return. Default is None, which means return both (k_hidden, i_hidden) when vector_type=”k”, return k_hidden; when vector_type=”i”, return i_hidden;

infer_tokens(items: dict, **kwargs) Tensor[source]
property vector_size

EduNLP.Vector.quesnet

class EduNLP.Vector.quesnet.quesnet.QuesNetModel(pretrained_dir, img_dir=None, device='cpu', **kwargs)[source]
infer_vector(items: Union[dict, list]) Tensor[source]

get question embedding with quesnet

Parameters

items – encodes from tokenizer

infer_tokens(items: Union[dict, list]) Tensor[source]

get token embeddings with quesnet

Parameters

items – encodes from tokenizer

Returns

word_embs + meta_emb

Return type

torch.Tensor

property vector_size

EduNLP.Vector.elmo_vec

class EduNLP.Vector.elmo_vec.ElmoModel(pretrained_dir: str)[source]
infer_vector(items: Tuple[dict, List[dict]], *args, **kwargs) Tensor[source]
infer_tokens(items, *args, **kwargs) Tensor[source]
property vector_size

EduNLP.Vector.gensim_vec

class EduNLP.Vector.gensim_vec.W2V(filepath, method=None, binary=None)[source]

The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.

Parameters
  • filepath – path to the pretrained model file

  • method (str) – fasttext other(Word2Vec)

  • binary (bool) –

key_to_index(word)[source]
property vectors
property vector_size
infer_vector(items, agg='mean', *args, **kwargs) list[source]
infer_tokens(items, *args, **kwargs) list[source]
class EduNLP.Vector.gensim_vec.BowLoader(filepath)[source]

Using doc2bow model, which has a lot of effects.

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.

If allow_update is not set, this function is const, aka read-only.

infer_vector(item, return_vec=False)[source]
property vector_size
class EduNLP.Vector.gensim_vec.TfidfLoader(filepath)[source]

This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.

infer_vector(item, return_vec=False)[source]
property vector_size
class EduNLP.Vector.gensim_vec.D2V(filepath, method='d2v')[source]

It is a collection which include d2v, bow, tfidf method.

Parameters
  • filepath

  • method (str) – d2v bow tfidf

  • item

Returns

d2v model

Return type

D2V

property vector_size
infer_vector(items, *args, **kwargs) list[source]
infer_tokens(item, *args, **kwargs) ...[source]

EduNLP.Vector.embedding

class EduNLP.Vector.embedding.Embedding(w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), freeze=True, device=None, **kwargs)[source]
infer_token_vector(items: List[List[str]], indexing=True) tuple[source]
indexing(items: List[List[str]], padding=False, indexing=True) tuple[source]
Parameters
  • items (list of list of str(word/token)) –

  • padding (bool) – whether padding the returned list with default pad_val to make all item in items have the same length

  • indexing (bool) –

Returns

  • token_idx (list of list of int) – the list of the tokens of each item

  • token_len (list of int) – the list of the length of tokens of each item

set_device(device)[source]