SYNOPSIS
phonetisaurus-align --input=dictionary.bsf --ofile=training.corpus [OPTIONS]
DESCRIPTION
phonetisaurus-align
This tool read an input dictionary and produce an aligned corpus that can be used to train a model for Grapheme-to-Phoneme conversion.
INPUT FORMAT
The input format is a two columns plain-text file. The first column is supposed to contain a graphemes sequence (e.g., the orthographic form of a word). The second column is supposed to contain the corresponding phonemes sequence.
By default the two columns are separated by a TAB character (it is possible to change the separator using the --delim option), each character of the first column is supposed to be a grapheme (it is possible to specify a grapheme separator using --seq1_sep), phonemes in the second column are separated by spaces (it is possible to change the phoneme separator using --seq2_sep).
Input example:
ABBREVIATE AH B R IY V IY EY T
OPTIONS
- --help=<bool> (default: false)
-
- show usage information
- --helpshort=<bool> (default: false)
-
- show brief usage information
- --tmpdir=<string> (default: "/tmp/")
-
- temporary directory
- --v=<int32> (default: 0)
-
- verbose level
- --fst_align=<bool> (default: false)
-
- Write FST data aligned where appropriate
- --fst_default_cache_gc=<bool> (default: true)
-
- Enable garbage collection of cache
- --fst_default_cache_gc_limit=<int64> (default: 1048576)
-
- Cache byte size that triggers garbage collection
- --fst_verify_properties=<bool> (default: false)
-
- Verify fst properties queried by TestProperties
- --fst_weight_parentheses=<string> (default: "")
-
- Characters enclosing the first weight of a printed composite weight (e.g. pair weight, tuple weight and derived classes) to ensure proper I/O of nested composite weights; must have size 0 (none) or 2 (open and close parenthesis)
- --fst_weight_separator=<string> (default: "")
-
- Character separator between printed composite weights; must be a single character
- --save_relabel_ipairs=<string> (default: "")
-
- Save input relabel pairs to file
- --save_relabel_opairs=<string> (default: "")
-
- Save output relabel pairs to file --delim=<string> (default: " ")
- Save output relabel pairs to file --delim=<string> (default: " ")
- Delimiter used to separate input and output tokens.
- --eps=<string> (default: "<eps>")
-
- Epsilon symbol.
- --fb=<bool> (default: false)
-
- Use forward-backward pruning for the alignment lattices.
- --input=<string> (default: "")
-
- Two-column input file to align.
- --iter=<int32> (default: 11)
-
- Maximum number of EM iterations to perform.
- --lattice=<bool> (default: false)
-
- Write out the alignment lattices as an fst archive (.far).
- --model=<bool> (default: true)
-
- Load a pre-trained model for use.
- --mbr=<bool> (default: false)
-
- Use the LMBR decoder (not yet implemented).
- --model_file=<string> (default: "")
-
- FST-format alignment model to load.
- --nbest=<int32> (default: 1)
-
- Output the N-best alignments given the model.
- --ofile=<string> (default: "")
-
- Output file to write the aligned dictionary to.
- --penalize=<bool> (default: true)
-
- Penalize scores.
- --penalize_em=<bool> (default: false)
-
- Penalize links during EM training.
-
--pthresh=<double> (default: -99)
- Pruning threshold. Use to prune unlikely N-best candidates when using multiple alignments.
- --restrict=<bool> (default: true)
-
- Restrict links to M-1, 1-N during initialization.
- --s1_char_delim=<string> (default: "")
-
- Sequence one input delimiter.
- --s1s2_sep=<string> (default: "}")
-
- Token used to separate input-output subsequences in the g2p model.
- --s2_char_delim=<string> (default: " ")
-
- Sequence two input delimiter.
- --seq1_del=<bool> (default: true)
-
- Allow deletions in sequence one.
- --seq1_max=<int32> (default: 2)
-
- Maximum subsequence length for sequence one.
- --seq1_sep=<string> (default: "|")
-
- Multi-token separator for input tokens.
- --seq2_del=<bool> (default: true)
-
- Allow deletions in sequence two.
- --seq2_max=<int32> (default: 2)
-
- Maximum subsequence length for sequence two.
- --seq2_sep=<string> (default: "|")
-
- Multi-token separator for output tokens.
- --skip=<string> (default: "_")
-
- Skip token used to represent null transitions. Distinct from epsilon.
- --thresh=<double> (default: 1e-10)
-
- Delta threshold for EM training termination.
- --write_model=<string> (default: "")
-
- Write out the alignment model in OpenFst format to filename.
- --fst_compat_symbols=<bool> (default: true)
-
- Require symbol tables to match when appropriate
- --fst_field_separator=<string> (default: " ")
-
- Set of characters used as a separator between printed fields
- --fst_error_fatal=<bool> (default: true)
-
- FST errors are fatal; o.w. return objects flagged as bad: e.g., FSTs - kError prop. true, FST weights - not a Member()
- FST errors are fatal; o.w. return objects flagged as bad: e.g., FSTs - kError prop. true, FST weights - not a Member()