字特征

OpenNMT 以离散标签的形式支持源和目标词的其它特征。

  • On the source side, these features act as additional information to the encoder. 针对每一个标签,嵌入先被优化,然后作为附加源信息与它所诠释的字一起被输入。
  • On the target side, these features will be predicted by the network. 解码器会解码句子并诠释每一个字。

如需使用附加特征,可以通过用特殊字符 (unicode character FFE8) 为每一个字添加标签的方式直接修改您的数据。 There can be an arbitrary number of additional features in the form word│feat1│feat2│...│featN but each word must have the same number of features and in the same order. 源数据和目标数据的附加特征数可以不同。

data/src-train-case.txt 是使用分割特征来表示每个字大小写的例子。 Using case as a feature is a way to optimize the word dictionary (no duplicated words like "the" and "The") and gives the system an additional information that can be useful to optimize its objective function.

it│C is│l not│l acceptable│l that│l ,│n with│l the│l help│l of│l the│l national│l bureaucracies│l ,│n parliament│C 's│l legislative│l prerogative│l should│l be│l made│l null│l and│l void│l by│l means│l of│l implementing│l provisions│l whose│l content│l ,│n purpose│l and│l extent│l are│l not│l laid│l down│l in│l advance│l .│n

You can generate this case feature with OpenNMT's tokenization script and the -case_feature flag.

Time-shifting

By default, word features on the target side are automatically shifted compared to the words so that their prediction directly depends on the word they annotate. This way, the decoder architecture is similar to a RNN-based sequence tagger with the output of a timestep being the tag of the input.

More precisely at timestep :

  • the inputs are and
  • the outputs are and

To reuse available vocabulary, is set to the end of sentence token.

Vocabularies

默认情况下,特征词汇的大小是不受限制的。 Depending on the type of features you are using, you may want to limit their vocabulary during the preprocessing with the -src_vocab_size and -tgt_vocab_size options in the format word_vocab_size[ feat1_vocab_size[ feat2_vocab_size[ ...]]]. 例如:

# unlimited source features vocabulary size
-src_vocab_size 50000

# first feature vocabulary is limited to 60, others are unlimited
-src_vocab_size 50000 60

# second feature vocabulary is limited to 100, others are unlimited
-src_vocab_size 50000 0 100

# limit vocabulary size of the first and second feature
-src_vocab_size 50000 60 100

You can similarly use -src_words_min_frequency and -tgt_words_min_frequency to limit vocabulary by frequency instead of absolute size.

Like words, word features vocabularies can be reused across datasets with the -features_vocabs_prefix. For example, if the processing generates theses features dictionaries:

  • data/demo.source_feature_1.dict
  • data/demo.source_feature_2.dict
  • data/demo.source_feature_3.dict

you have to set -features_vocabs_prefix data/demo as command line option.

Embeddings

特征嵌入的大小根据特征所采用值的数量自动计算。 This default size reduction works well for features with few values like the case or POS.

对于其它特征,您可通过 -src_word_vec_size 和 -tgt_word_vec_size 选项,人工选择嵌入大小。 They behave similarly to -src_vocab_size with a list of embedding size: word_vec_size[ feat1_vec_size[ feat2_vec_size[ ...]]].

Then, each feature embedding is concatenated to each other by default. You can instead choose to sum them by setting -feat_merge sum. Finally, the resulting merged embedding is concatenated to the word embedding.

Warning

In the sum case, each feature embedding must have the same dimension. 您可以使用 -feat_vec_size 来设置公共嵌入的大小。

During decoding, the beam search is only applied on the target words space and not on the word features. When the beam path is complete, the associated features are selected along this path.