edit

tools/tokenize.lua

tokenize.lua options:

  • -h [] (default: false)
    This help.
  • -md [] (default: false)
    Dump help in Markdown format.
  • -config (default: '')
    Load options from this file.
  • -save_config (default: '')
    Save options to this file.

Tokenizer options

  • -mode (accepted: space, conservative, aggressive; default: conservative)
    Define how aggressive should the tokenization be. aggressive only keeps sequences of letters/numbers, conservative allows a mix of alphanumeric as in: "2,000", "E65", "soft-landing", etc. space is doing space tokenization.
  • -joiner_annotate [] (default: false)
    Include joiner annotation using -joiner character.
  • -joiner (default: )
    Character used to annotate joiners.
  • -joiner_new [] (default: false)
    In -joiner_annotate mode, -joiner is an independent token.
  • -case_feature [] (default: false)
    Generate case feature.
  • -segment_case [] (default: false)
    Segment case feature, splits AbC to Ab C to be able to restore case
  • -segment_alphabet (accepted: Tagalog, Hanunoo, Limbu, Yi, Hebrew, Latin, Devanagari, Thaana, Lao, Sinhala, Georgian, Kannada, Cherokee, Kanbun, Buhid, Malayalam, Han, Thai, Katakana, Telugu, Greek, Myanmar, Armenian, Hangul, Cyrillic, Ethiopic, Tagbanwa, Gurmukhi, Ogham, Khmer, Arabic, Oriya, Hiragana, Mongolian, Kangxi, Syriac, Gujarati, Braille, Bengali, Tamil, Bopomofo, Tibetan)
    Segment all letters from indicated alphabet.
  • -segment_alphabet_change [] (default: false)
    Segment if alphabet change between 2 letters.
  • -bpe_model (default: '')
    Apply Byte Pair Encoding if the BPE model path is given. If the option is used, BPE related options will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua.
  • -EOT_marker (default: )
    Marker used to mark the end of token.
  • -BOT_marker (default: )
    Marker used to mark the beginning of token.
  • -bpe_case_insensitive [] (default: false)
    Apply BPE internally in lowercase, but still output the truecase units. This option will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua.
  • -bpe_mode (accepted: suffix, prefix, both, none; default: suffix)
    Define the BPE mode. This option will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua. prefix: append -BOT_marker to the begining of each word to learn prefix-oriented pair statistics; suffix: append -EOT_marker to the end of each word to learn suffix-oriented pair statistics, as in the original Python script; both: suffix and prefix; none: no suffix nor prefix.
  • Other options

    • -nparallel (default: 1)
      进行切分的平行线程(parallel thread )数
    • -batchsize (default: 1000)
      各平行批次(parallel batch)的大小 – 除非内存很少,否则不要改变