SYNOPSYS
ucto [[options]] [input-file] [[output-file]]
DESCRIPTION
ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.
OPTIONS
-c configfile
- read settings from a file
-d value
- set debug mode to 'value'
-e value
- set input encoding. (default UTF8)
-N value
- set UTF8 output normalization. (default NFC)
-f
- disable filtering of special characters
-L language
-
Automatically selects a configuration file by language code. e.g. 'fr' will select the file tokconfig-fr from the installation directory
-l
- Convert to all lowercase
-u
- Convert to all uppercase
-n
- Emit one sentence per line on output
-m
- Assume one sentence per line on input
--passthru
- Don't tokenize, but perform input decoding and simple token role detection
--filterpunct
- remove most of the punctuation from the output. (not from abreviations!)
-P
- Disable Paragraph Detection
-Q
- Enable Quote Detection. (this is experimental and may lead to unexpected results)
-S
- Disable Sentence Detection
-s <string>
- Set End-of-sentence marker. (Default <utt>)
-V
- Show version information
-v
- set Verbose mode
-F
- Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS)
--textclasscls
- When tokenizing a FoLiA XML document, search for text nodes of class 'cls'
-X
- Output FoLiA XML. (this disables usage of most other options: -nulPQvsS)
--id <DocId>
- Use the specified Document ID for the FoLiA XML
-x <DocId> (obsolete)
-
Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS)
obsolete Use -X and --id instead
BUGS
likely