edit

Unknown words

The default translation mode allows the model to produce the symbol when it is not sure of the specific target word.

Often times symbols will correspond to proper names that can be directly transposed between languages. The -replace_unk option will substitute with source words that have the highest attention weight. The -replace_unk_tagged option will do the same, but wrap the token in a ⦅unk:xxxxx⦆ tag.

Phrase table

Alternatively, advanced users may prefer to provide a pre-constructed phrase table from an external aligner (such as fast_align) using the -phrase_table option to allow for non-identity replacement.

它将在该短语表中查找可能的翻译,而不是复制关注度最高的源切分。 If a valid replacement is not found only then the source token will be copied.

短语表是每行一个翻译的文件,其格式如下:

source|||target

其中 sourcetarget 是 需区分大小写的  而且是 切分。

Workarounds

Several techniques exist to minimize the out-of-vocabulary issue:

  • sub-tokenization like BPE or "wordpiece" to simulate open vocabularies
  • mixed word/characters model as described in Wu et al. (2016)