Why is Email So Complicated?
Reason # 501: Human Communication is Absurdly Complex!

In this series of blog posts -- which started on the Mimecast site before migrating here -- I've been writing primarily about the technical complexities that make email a much more interesting business than it seems at first glance. But some of the most daunting complications are not technical; email needs to support the complexity of social interaction in general.

A straightforward example of this is the MIME protocol, which I co-designed twenty years ago. Some of the complexities in MIME (such as 7 bit encodings or multipart boundaries) might be called "contingent" or even "fundamentally unnecessary" because they exist only for backwards compatibility with the pre-MIME email world. If one redesigned email from scratch, these things would probably go away. However, much of the complexity of MIME comes from the fact that people want to be able to communicate a wide range of information. It needs to represent text, images, sounds, video, and so on -- to the point where there are now over a thousand registered MIME types, each of which needs to be handled differently when displayed to the user. The world of MIME types is complicated not because we failed to make it simpler, but because human commuication requires vast numbers of data types.

However, the single most complicated aspect of email -- or any other computer-mediated communication, although email always seems to wrestle with the problems first -- is the lingering effects of the Tower of Babel. There are an estimated 6700 languages in the world, and even though thousands are in the process of dying out, that still leaves thousands to support. People want to be able to send and receive email in their own languages, and this leads to staggering complexity.

Languages vary impressively. Most western languages go from left to right, but others such as Arabic and Hebrew go from right to left, and some Asian languages go from top to bottom. English speakers tend to think of text as a simple series of characters, but in other languages there are special marks that need to be added above or underneath some letters. In some languages, there are different representations of the same character in different contexts, for example when the letter is at the end of a word. Increasingly, non-English speaking people are, understandably, demanding the ability to represent their languages "properly."

The real-world diversity of languages and their scripts is further complicated by the introduction, in the world of computers, of the notion of "character sets." Contrary to casual assumption, character sets and languages do not map onto each other simply. There are dozens of character sets in which English can be represented, for example, and there are dozens of character sets that can represent more than one language. Worse still, there are "character sets" in common use that do not come from any standards body or process, but represent the unilateral representation of a single vendor.

Some of you are tapping your fingers impatiently at this point -- why am I nattering about character sets now that we have Unicode, a single, huge character set that is intended to superseed all of them? Alas, not all software can support Unicode yet, but even when they do we will still have the bane of email protocol design: backward compatibility. Even when every computer in the world understands Unicode, the world will be full of email archives with messages in every other character set imaginable. If users are to be able to view such archives, implementors need to deal with all the historical complexities.

Besides, Unicode is a big step forward, but it introduces new complexities of its own. To begin with, there are a host of ways, known as character encodings, to represent it digitally; Internet protocols generally use the representation known as UTF-8, which is by no means the simplest, but which has total backward compatibility with US-ASCII, the default character set for MIME and most older applications. Other character codings exist: Some are obsolete but may still be needed for backward-compatibility, while others have certain advantages for other applications, and are here to stay.

Another problem with Unicode is that it isn't -- and probably never will be -- completely finished. Every couple of years, the standard is extended to include a bunch of characters that someone, somewhere on this planet considers essential, though it's sometimes hard to see why. This means that even a perfect Unicode implementation needs to be updated every few years.

Unfortunately, the success of the Unicode standard opens a bit of a security vulnerability. With over 100,000 characters and growing, it's not surprising that some of the characters in Unicode look almost identical. This makes it much harder to perform accurate string comparisons, which can lead to odd results. If software reports that the two strings "Unicode" and "Unicode" are different, that might mean that the first i is from the English character code points, while the second comes from the Turkish code points. Writing code that takes all of this into account is fiendishly complicated.

All of this creates a nasty opening for phishers, among others. Consider a standard phishing scam: an email message wants you to click on::

http://www.cit1bank.com

Many people (not the sophisticated patrons of this site, of course) are fooled by the substitution of "1" for "i" and go to the malicious site. But with Unicode, it will be hard for even the most sophisticated of us to recognize that there is a difference between

http://www.citibank.com

and

http://www.citibank.com

You can stop trying to spot the difference; there isn't one. So it becomes the responsibility of security software to understand the differences and warn you away from even the cleverest phishing scams.

The advent of Unicode is particularly vexing for email, where the syntactic treatment of characters varies within the message structure, with radically different syntactic restrictions. MIME made it straightforward to put almost any language in the body, and MIME part two provided an ugly but workable mechanism for putting almost any language in almost any part of the message header. But left out, for the last twenty-odd years, has been the ability to put international characters in domain names or email addresses. It appears it is finally going to happen; I'll say more in a future post.

Email software won't get any less complicated as it tries to deal with these problems. Ultimately, email needs to be complicated enough to meet all the needs of human communication.