man Encode::IMAPUTF7 (3): modification of UTF-7 encoding for IMAP

SYNOPSIS

use Encode qw/encode decode/;
use Encode::IMAPUTF7;
print encode('IMAP-UTF-7', 'RA~Xpertoire');
print decode('IMAP-UTF-7', R&AOk-pertoire');

ABSTRACT

IMAP mailbox names are encoded in a modified UTF7 when names contains international characters outside of the printable ASCII range. The modified UTF-7 encoding is defined in RFC2060 (section 5.1.3).

There is another CPAN module with same purpose, Unicode::IMAPUtf7. However, it works correctly only with strings, which encoded form does not contain plus sign. For example, the Cyrillic string \x{043f}\x{0440}\x{0435}\x{0434}\x{043b}\x{043e}\x{0433} is represented in UTF-7 as +BD8EQAQ1BDQEOwQ+BDM- Note the second plus sign 4 characters before the end. Unicode::IMAPUtf7 encodes the above string as +BD8EQAQ1BDQEOwQ&BDM- which is not valid modified UTF-7 (the ampersand and the plus are swapped). The problem is solved by the current module, which is slightly modified Encode::Unicode::UTF7 and has nothing common with Unicode::IMAPUtf7.

RFC2060 - section 5.1.3 - Mailbox International Naming Convention

By convention, international mailbox names are specified using a modified version of the UTF-7 encoding described in [UTF-7]. The purpose of these modifications is to correct the following problems with UTF-7:

1) UTF-7 uses the ``+'' character for shifting; this conflicts with
the common use of ``+'' in mailbox names, in particular USENET
newsgroup names.

2) UTF-7's encoding is BASE64 which uses the ``/'' character; this
conflicts with the use of ``/'' as a popular hierarchy delimiter.

3) UTF-7 prohibits the unencoded usage of ``\''; this conflicts with
the use of ``\'' as a popular hierarchy delimiter.

4) UTF-7 prohibits the unencoded usage of ``~''; this conflicts with
the use of ``~'' in some servers as a home directory indicator.

5) UTF-7 permits multiple alternate forms to represent the same
string; in particular, printable US-ASCII chararacters can be
represented in encoded form.

In modified UTF-7, printable US-ASCII characters except for ``&'' represent themselves; that is, characters with octet values 0x20-0x25 and 0x27-0x7e. The character ``&'' (0x26) is represented by the two- octet sequence ``&-''.

All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all Unicode 16-bit octets) are represented in modified BASE64, with a further modification from [UTF-7] that ``,'' is used instead of ``/''. Modified BASE64 MUST NOT be used to represent any printing US-ASCII character which can represent itself.

``&'' is used to shift to modified BASE64 and ``-'' to shift back to US- ASCII. All names start in US-ASCII, and MUST end in US-ASCII (that is, a name that ends with a Unicode 16-bit octet MUST end with a ``- '').

For example, here is a mailbox name which mixes English, Japanese, and Chinese text: ~peter/mail/&ZeVnLIqe-/&U,BTFw-

REQUESTS & BUGS

Please report any requests, suggestions or bugs via the RT bug-tracking system at http://rt.cpan.org/ or email to [email protected].

http://rt.cpan.org/NoAuth/Bugs.html?Dist=Encode-IMAPUTF7 is the RT queue for Encode::IMAPUTF7. Please check to see if your bug has already been reported.

COPYRIGHT

Sava Chankov, [email protected]

This software may be freely copied and distributed under the same terms and conditions as Perl.

AUTHORS

Peter Makholm <[email protected]>, current maintainer

Sava Chankov <[email protected]>, original author