Unicode::CaseFold(3) Unicode case-folding for case-insensitive lookups.

VERSION

version 1.00

SYNOPSIS


use Unicode::CaseFold;

my $folded = fc $string;

What is Case-Folding?

In non-Unicode contexts, a common idiom to compare two strings case-insensitively is "lc($this) eq lc($that)". Before comparing two strings we normalize them to an all-lowercase version. "Hello", "HELLO", and "HeLlO" all have the same lowercase form ("hello"), so it doesn't matter which one we start with; they are all equal to one another after "lc".

In Unicode, things aren't so simple. A Unicode character might have mappings for uppercase, lowercase, and titlecase, and the lowercase mapping of the uppercase mapping of a given character might not be the character that you started with! For example "lc(uc("\N{LATIN SMALL LETTER SHARP S"))" is "ss", not the eszett we started off with! Case-folding is a part of the Unicode standard that allows any two strings that differ from one another only by case to map to the same ``case-folded'' form, even when those strings include characters with complex case-mappings.

Use for Case-insensitive Comparison

Simply write "fc($this) eq fc($that)" instead of "lc($this) eq lc($that)". You can also use "index" on case-folded strings for substring search.

Use for String Lookups

Frequently we want to store data in a hash, or a database, or an external file for later retrieval. Sometimes we want to be able to match the keys in this data case-insensitively --- that is, we should be able to store some data under the key ``hello'' and later retrieve it with the key ``HELLO''. Some databases have complete support for collation, but in other databases the support is missing or broken, and Perl hashes don't support it at all. By making case-folding part of the process you use to normalize your keys before using them to access a database or data structure, you get case-insensitive lookup.

    $roles{fc "Samuel L. Jackson"} = ["Gin Rummy", "Nick Fury", "Mace Windu"];
    
    $roles = $roles{fc "Samuel l. JACKSON"}; # Gets the data.

DESCRIPTION

This module provides Unicode case-folding for Perl. Case-folding is a tool that allows a program to make case-insensitive string comparisons or do case-insensitive lookups.

EXPORTS

fc($str)

Exported by default when you use the module. "use Unicode::CaseFold ()" or "use Unicode::CaseFold qw(case_fold !fc)" if you don't want it to be exported.

Returns the case-folded version of $str. This function is prototyped to act as much as possible like the built-ins "lc" and "uc"; it imposes a scalar context on its argument, and if called with no argument it will return the case-folded version of $_.

case_fold($str)

Exported on request. Just like "fc", except that it has no prototype and won't case-fold $_ if called without an argument.

VARIABLES

$Unicode::CaseFold::XS

Whether the XS extension is in use. The pure-perl implementation is 5-10 times slower than the XS extension, and on versions of perl before 5.10.0 it will use simple case-folding instead of full case-folding (see below).

$Unicode::CaseFold::SIMPLE_FOLDING

Is set to true if the perl version is prior to 5.10.0 and the XS extension is not available. In this case, "fc" will perform a simple case-folding instead of a full case-folding. Although relatively few characters are affected, strings case-folded using simple folding might not compare equal to the corresponding strings case-folded with full folding, which may cause compatibility issues.

Furthermore, when simple folding is in use, some strings that would have case-folded to the same value when using full folding will instead case-fold to different values. For example, "fc("Wei\x{df}")" and "fc("Weiss")" both produce "weiss" when full folding is in effect, but the former produces "wei\x{df}" when using simple folding.

If you want to check for this potentially dangerous situation, consult the $Unicode::CaseFold::SIMPLE_FOLDING variable.

COMPATIBILITY

  • "Unicode::CaseFold" requires Perl 5.8.1 or newer.
  • Different versions of perl include different versions of the Unicode database, which is revised over time. If you are likely to be comparing strings that have been folded using different versions of perl, you may need to consult the changes for intervening Unicode standard versions to find out whether your code will work correctly.
  • "Unicode::CaseFold" uses ``simple'' rather than ``full'' case-folding when operating in Pure-perl mode on perl versions previous to 5.10.0. For compatibility implications, see ``$Unicode::CaseFold::SIMPLE_FOLDING''.

AUTHOR

Andrew Rodland <[email protected]>

COPYRIGHT AND LICENSE

This software is copyright (c) 2014 by Andrew Rodland.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.