快速入门

步骤 1: Preprocess the data

th preprocess.lua -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo

We will be working with some example data in data/ folder.

The data consists of parallel source (src) and target (tgt) data containing one sentence per line with tokens separated by a space:

  • src-train.txt
  • tgt-train.txt
  • src-val.txt
  • tgt-val.txt

Validation files are required and used to evaluate the convergence of the training. It usually contains no more than 5000 sentences.

$ head -n 3 data/src-train.txt
It is not acceptable that , with the help of the national bureaucracies , Parliament 's legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .
" Two soldiers came up to me and told me that if I refuse to sleep with them , they will kill me . They beat me and ripped my clothes .

After running the preprocessing, the following files are generated:

  • demo.src.dict: 源语言词汇到索引的映射字典。
  • demo.tgt.dict: 目标词汇语言到索引的映射字典
  • demo-train.t7: 包括词汇、训练和验证数据的序列化 Torch 文件。

The *.dict files are needed to check or reuse the vocabularies. 这些文件是人们可以看懂的简单字典。

$ head -n 10 data/demo.src.dict
<blank> 1
<unk> 2
<s> 3
</s> 4
It 5
is 6
not 7
acceptable 8
that 9
, 10
with 11

就内部而言,系统从不接触这些字,而是使用这些索引。

Note

If the corpus is not tokenized, you can use OpenNMT's tokenizer.

步骤 2: Train the model

th train.lua -data data/demo-train.t7 -save_model demo-model

主指令相当简单。 概括起来就是提取数据文件和存储文件。 它使用默认模型,该模型由两层长短期记忆(LSTM)组成,在编码器/解码器中各有500个隐形单元。 您也可以增加-gpuid 1来使用其它元件,诸如GPU 1.

步骤 3: Translate

th translate.lua -model demo-model_epochX_PPL.t7 -src data/src-test.txt -output pred.txt

至此,您已有了一个可以用来预测新数据的模型。 我们是通过集束搜索(beam search)来完成的。 This will output predictions into pred.txt.

Note

由于演示数据集很小,预测结果会很差。 请尝试使用大一些的数据集! 例如,您可以为translation 或 summarization下载数以百万计的双语对应句。