ucto(1) Unicode Tokenizer


ucto [[options]] [input-file] [[output-file]]


ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.


-c configfile

read settings from a file

-d value

set debug mode to 'value'

-e value

set input encoding. (default UTF8)

-N value

set UTF8 output normalization. (default NFC)


disable filtering of special characters

-L language

 Automatically selects a configuration file by language code. e.g. 'fr' will select the file tokconfig-fr from the installation directory


Convert to all lowercase


Convert to all uppercase


Emit one sentence per line on output


Assume one sentence per line on input


Don't tokenize, but perform input decoding and simple token role detection


remove most of the punctuation from the output. (not from abreviations!)


Disable Paragraph Detection


Enable Quote Detection. (this is experimental and may lead to unexpected results)


Disable Sentence Detection

-s <string>

Set End-of-sentence marker. (Default <utt>)


Show version information


set Verbose mode


Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS)


When tokenizing a FoLiA XML document, search for text nodes of class 'cls'


Output FoLiA XML. (this disables usage of most other options: -nulPQvsS)

--id <DocId>

Use the specified Document ID for the FoLiA XML

-x <DocId> (obsolete)

Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS)

obsolete Use -X and --id instead




Maarten van Gompel [email protected]

Ko van der Sloot [email protected]