edit

preprocess.lua

preprocess.lua options:

  • -h [] (default: false)
    This help.
  • -md [] (default: false)
    Dump help in Markdown format.
  • -config (default: '')
    Load options from this file.
  • -save_config (default: '')
    Save options to this file.

Preprocess options

  • -data_type (accepted: bitext, monotext, feattext; default: bitext)
    Type of data to preprocess. Use 'monotext' for monolingual data. This option impacts all options choices.
  • -save_data (required)
    Output file for the prepared data.

Data options

  • -train_dir (default: '')
    Path to training files directory.
  • -train_src (default: '')
    Path to the training source data.
  • -train_tgt (default: '')
    Path to the training target data.
  • -valid_src (default: '')
    Path to the validation source data.
  • -valid_tgt (default: '')
    Path to the validation target data.
  • -src_vocab (default: '')
    Path to an existing source vocabulary.
  • -src_suffix (default: .src)
    Suffix for source files in train/valid directories.
  • -src_vocab_size (default: 50000)
    List of source vocabularies size: word[ feat1[ feat2[ ...] ] ]. If = 0, vocabularies are not pruned.
  • -src_words_min_frequency
  • (default: 0)
    List of source words min frequency: word[ feat1[ feat2[ ...] ] ]. If = 0, vocabularies are pruned by size.
  • -tgt_vocab (default: '')
    Path to an existing target vocabulary.
  • -tgt_suffix (default: .tgt)
    Suffix for target files in train/valid directories.
  • -tgt_vocab_size
  • (default: 50000)
    List of target vocabularies size: word[ feat1[ feat2[ ...] ] ]. If = 0, vocabularies are not pruned.
  • -tgt_words_min_frequency
  • (default: 0)
    List of target words min frequency: word[ feat1[ feat2[ ...] ] ]. If = 0, vocabularies are pruned by size.
  • -src_seq_length (default: 50)
    Maximum source sequence length.
  • -tgt_seq_length (default: 50)
    Maximum target sequence length.
  • -check_plength [] (default: false)
    Check source and target have same length (for seq tagging).
  • -features_vocabs_prefix (default: '')
    Path prefix to existing features vocabularies.
  • -time_shift_feature [] (default: true)
    Time shift features on the decoder side.
  • -keep_frequency [] (default: false)
    Keep frequency of words in dictionary.
  • -gsample (default: 0)
    If not zero, extract a new sample from the corpus. In training mode, file sampling is done at each epoch. Values between 0 and 1 indicate ratio, values higher than 1 indicate data size
  • -gsample_dist (default: '')
    Configuration file with data class distribution to use for sampling training corpus. If not set, sampling is uniform.
  • -sort [] (default: true)
    If set, sort the sequences by size to build batches without source padding.
  • -shuffle [] (default: true)
    If set, shuffle the data (prior sorting).
  • -idx_files [] (default: false)
    If set, source and target files are 'key value' with key match between source and target.
  • -report_progress_every (default: 100000)
    Report status every this many sentences.
  • -preprocess_pthreads (default: 4)
    Number of parallel threads for preprocessing.
  • Tokenizer options

    • -tok_src_mode (accepted: conservative, aggressive, space; default: space)
      Define how aggressive should the tokenization be. space is space-tokenization.
    • -tok_tgt_mode (accepted: conservative, aggressive, space; default: space)
      Define how aggressive should the tokenization be. space is space-tokenization.
    • -tok_src_joiner_annotate [] (default: false)
      Include joiner annotation using -joiner character.
    • -tok_tgt_joiner_annotate [] (default: false)
      Include joiner annotation using -joiner character.
    • -tok_src_joiner (default: )
      Character used to annotate joiners.
    • -tok_tgt_joiner (default: )
      Character used to annotate joiners.
    • -tok_src_joiner_new [] (default: false)
      In -joiner_annotate mode, -joiner is an independent token.
    • -tok_tgt_joiner_new [] (default: false)
      In -joiner_annotate mode, -joiner is an independent token.
    • -tok_src_case_feature [] (default: false)
      Generate case feature.
    • -tok_tgt_case_feature [] (default: false)
      Generate case feature.
    • -tok_src_segment_case [] (default: false)
      Segment case feature, splits AbC to Ab C to be able to restore case
    • -tok_tgt_segment_case [] (default: false)
      Segment case feature, splits AbC to Ab C to be able to restore case
    • -tok_src_segment_alphabet
    (accepted: Tagalog, Hanunoo, Limbu, Yi, Hebrew, Latin, Devanagari, Thaana, Lao, Sinhala, Georgian, Kannada, Cherokee, Kanbun, Buhid, Malayalam, Han, Thai, Katakana, Telugu, Greek, Myanmar, Armenian, Hangul, Cyrillic, Ethiopic, Tagbanwa, Gurmukhi, Ogham, Khmer, Arabic, Oriya, Hiragana, Mongolian, Kangxi, Syriac, Gujarati, Braille, Bengali, Tamil, Bopomofo, Tibetan)
    Segment all letters from indicated alphabet.
  • -tok_tgt_segment_alphabet
  • (accepted: Tagalog, Hanunoo, Limbu, Yi, Hebrew, Latin, Devanagari, Thaana, Lao, Sinhala, Georgian, Kannada, Cherokee, Kanbun, Buhid, Malayalam, Han, Thai, Katakana, Telugu, Greek, Myanmar, Armenian, Hangul, Cyrillic, Ethiopic, Tagbanwa, Gurmukhi, Ogham, Khmer, Arabic, Oriya, Hiragana, Mongolian, Kangxi, Syriac, Gujarati, Braille, Bengali, Tamil, Bopomofo, Tibetan)
    Segment all letters from indicated alphabet.
  • -tok_src_segment_alphabet_change [] (default: false)
    Segment if alphabet change between 2 letters.
  • -tok_tgt_segment_alphabet_change [] (default: false)
    Segment if alphabet change between 2 letters.
  • -tok_src_bpe_model (default: '')
    Apply Byte Pair Encoding if the BPE model path is given. If the option is used, BPE related options will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua.
  • -tok_tgt_bpe_model (default: '')
    Apply Byte Pair Encoding if the BPE model path is given. If the option is used, BPE related options will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua.
  • -tok_src_EOT_marker (default: )
    Marker used to mark the end of token.
  • -tok_tgt_EOT_marker (default: )
    Marker used to mark the end of token.
  • -tok_src_BOT_marker (default: )
    Marker used to mark the beginning of token.
  • -tok_tgt_BOT_marker (default: )
    Marker used to mark the beginning of token.
  • -tok_src_bpe_case_insensitive [] (default: false)
    Apply BPE internally in lowercase, but still output the truecase units. This option will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua.
  • -tok_tgt_bpe_case_insensitive [] (default: false)
    Apply BPE internally in lowercase, but still output the truecase units. This option will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua.
  • -tok_src_bpe_mode (accepted: suffix, prefix, both, none; default: suffix)
    Define the BPE mode. This option will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua. prefix: append -BOT_marker to the begining of each word to learn prefix-oriented pair statistics; suffix: append -EOT_marker to the end of each word to learn suffix-oriented pair statistics, as in the original Python script; both: suffix and prefix; none: no suffix nor prefix.
  • -tok_tgt_bpe_mode (accepted: suffix, prefix, both, none; default: suffix)
    Define the BPE mode. This option will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua. prefix: append -BOT_marker to the begining of each word to learn prefix-oriented pair statistics; suffix: append -EOT_marker to the end of each word to learn suffix-oriented pair statistics, as in the original Python script; both: suffix and prefix; none: no suffix nor prefix.
  • Logger options

    • -log_file (default: '')
      Output logs to a file under this path instead of stdout.
    • -disable_logs [] (default: false)
      If set, output nothing.
    • -log_level (accepted: DEBUG, INFO, WARNING, ERROR, NOERROR; default: INFO)
      Output logs at this level and above.

    Other options

    • -seed (default: 3425)
      Random seed.