man ucto (1): Unicode Tokenizer

SYNOPSYS

ucto [[options]] [input-file] [[output-file]]

DESCRIPTION

ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.

OPTIONS

-c configfile

: read settings from a file

-d value

: set debug mode to 'value'

-e value

: set input encoding. (default UTF8)

-N value

: set UTF8 output normalization. (default NFC)

-f

: disable filtering of special characters

-L language

: Automatically selects a configuration file by language code. e.g. 'fr' will select the file tokconfig-fr from the installation directory

-l

: Convert to all lowercase

-u

: Convert to all uppercase

-n

: Emit one sentence per line on output

-m

: Assume one sentence per line on input

--passthru

: Don't tokenize, but perform input decoding and simple token role detection

--filterpunct

: remove most of the punctuation from the output. (not from abreviations!)

-P

: Disable Paragraph Detection

-Q

: Enable Quote Detection. (this is experimental and may lead to unexpected results)

-S

: Disable Sentence Detection

-s <string>

: Set End-of-sentence marker. (Default <utt>)

-V

: Show version information

-v

: set Verbose mode

-F

: Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS)

--textclasscls

: When tokenizing a FoLiA XML document, search for text nodes of class 'cls'

-X

: Output FoLiA XML. (this disables usage of most other options: -nulPQvsS)

--id <DocId>

: Use the specified Document ID for the FoLiA XML

-x <DocId> (obsolete)

: Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS)
obsolete Use -X and --id instead

BUGS

likely

AUTHORS

Maarten van Gompel [email protected]

Ko van der Sloot [email protected]

SYNOPSYS

DESCRIPTION

OPTIONS

BUGS

AUTHORS

LAST SEARCHED