edit

Embeddings

Word embeddings are learned using a lookup table. Each word is assigned to a random vector within this table that is simply updated with the gradients coming from the network.

Pretrained

在使用少量数据进行训练时,可以先用预先训练过的嵌入来提高效果。 可用引数(arguments)-pre_word_vecs_dec-pre_word_vecs_enc 来指定这些文件。

The pretrained embeddings must be manually constructed Torch serialized tensors that correspond to the source and target dictionary files. 例如:

local vocab_size = 50004
local embedding_size = 500

local embeddings = torch.Tensor(vocab_size, embedding_size):uniform()

torch.save('enc_embeddings.t7', embeddings)

where embeddings[i] is the embedding of the -th word in the vocabulary.

To automate this process, OpenNMT provides a script tools/embeddings.lua than can download pretrained embeddings from Polyglot or convert trained embeddings from word2vec, GloVe or FastText with regard to the word vocabularies generated by preprocess.lua. Supported format are:

  • word2vec-bin (default): binary format generated by word2vec.
  • word2vec-txt: textual word2vec format - starts with header line containing number of words and embedding size, and is then followed by one line per embedding: the first token is the word, and following fields are the embeddings values.
  • glove: text format - same format than word2vec-txt but without header line.

Note

The script requires the lua-zlib package.

For example, to generate pretrained English words embeddings:

th tools/embeddings.lua -lang en -dict_file data/demo.src.dict -save_data data/demo-src-emb

Note

Languages codes are Polygot's Wikipedia Language Codes.

Or to map pretrained word2vec vectors to the built vocabulary:

th tools/embeddings.lua -embed_type word2vec -embed_file data/GoogleNews-vectors-negative300.bin -dict_file data/demo.src.dict\
                        -save_data data/demo-src-emb

Tip

If vocabs as-is are not found in the embeddings file, you can use -approximate option to also look for uppercase variants and variants without possible joiner marks. You can dump the non found vocabs by setting -save_unknown_dict parameter.

Fixed

By default these embeddings will be updated during training, but they can be held fixed using -fix_word_vecs_enc and -fix_word_vecs_dec options. These options can be enabled or disabled during a retraining.

Tip

When using pretrained word embeddings, if you declare a larger -word_vec_size then the difference is uniformally initalized and you can use -fix_word_vecs_enc pretrained (or -fix_word_vecs_dec pretrained) to fix the pretrained part and optimize the remaining part.

Extraction

The tools/extract_embeddings.lua script can be used to extract the model word embeddings into text files. They can then be easily transformed into another format for visualization or processing.