edit

Retraining

By default, OpenNMT saves a checkpoint every 5000 iterations and at the end of each epoch. For more frequent or infrequent saves, you can use the -save_every and -save_every_epochs options which define the number of iterations and epochs after which the training saves a checkpoint.

在几种情况下,有可能需要通过 -train_from 选项用保存的模型进行训练:

  • 继续进行已停止的训练
  • 继续使用较小的批进行培训
  • 用新数据训来练模型(增量适应)
  • 从预训练的参数开始训练
  • 其它情况

Considerations

When training from an existing model, some settings can not be changed:

  • the model topology (layers, hidden size, etc.)
  • the vocabularies

Exceptions

-dropout, -fix_word_vecs_enc and -fix_word_vecs_dec are model options that can be changed for a retraining.

Resuming a stopped training

训练中止的情况是很常见的, crash, server reboot, user action, etc. In this case, you may want to continue the training for more epochs by using using the -continue flag. 例如:

# start the initial training
th train.lua -gpuid 1 -data data/demo-train.t7 -save_model demo -save_every 50

# train for several epochs...

# need to reboot the server!

# continue the training from the last checkpoint
th train.lua -gpuid 1 -data data/demo-train.t7 -save_model demo -save_every 50 -train_from demo_checkpoint.t7 -continue

-continue 这一标志确保训练以相同的配置和优化状态继续进行。 特别是在以下选项被设置为它们最后已知值的时候:

  • -curriculum
  • -decay
  • -learning_rate_decay
  • -learning_rate
  • -max_grad_norm
  • -min_learning_rate
  • -optim
  • -start_decay_at
  • -start_decay_ppl_delta
  • -start_epoch
  • -start_iteration

Note

The -end_epoch value is not automatically set as the user may want to continue its training for more epochs past the end.

Additionally, the -continue flag retrieves from the previous training:

  • the non-SGD optimizers states
  • the random generator states
  • the batch order (when continuing from an intermediate checkpoint)

Training from pre-trained parameters

另一个案例就是使用一个基本模型,然后用新的选项对其进一步进行训练 (特别是优化方法和学习速率)。 使用 -train_from 但不选择 -continue 将用预训练模型初始化的参数启动一个新的训练。