For months, we’ve had ongoing challenges with odd characters () at the very beginning of certain e-mails. At the same time, we’ve had (seemingly unrelated) challenges with characters like apostrophes (‘) appearing as other odd characters in e-mails. Open the source documents in plain-text editors (TextEdit, Notepad, Coda), and they look fine—no problem. Send them through our e-mail generation system, and crazy characters appear. Must be the e-mail generation system, right?
Wrong.
It came down to the encoding our plain-text editors use. All of these text editors automatically encode the plain text based on whether non-ASCII characters are used. Include a curly apostrophe, and TextEdit encodes in Unicode. Unless you’ve got superhero vision, you won’t notice that the document includes a curly apostrophe. Now send that Unicode document to a system that’s expecting straight ASCII, and you get crazy characters in place of the curly apostrophe, since ASCII doesn’t include the curly apostrophe.
But wait, there’s more.
A Unicode-encoded file also has a Byte Order Mark (BOM) at the beginning to describe what encoding is used. Since the BOM is part of the description of the file, but not the body of the file, Unicode-enabled text editors ignore it. (It’s actually more complicated than that, but this description helps me keep it straight in my head. Check out the Wikipedia entry for the real explanation.) The point is, our poor HTML coder can’t see that there’s a problem.
So inadvertently include a non-ASCII character in a supposedly plain-text file, the file will likely get saved as Unicode, and have crazy characters when opened in an ASCII-only application—like most e-mail systems.
UnicodeChecker to the Rescue
Now that we know what the problem is, it’s easy to adjust our process to keep us clean. The main change: UnicodeChecker, a great little app from Steffen Kamp and Sven-S. Porst at earthlingsoft. It installs as a service on the Mac. Highlight your code in your text editor, click Services-UnicodeChecker, and it will convert any Unicode symbols to their HMTL/ASCII equivalent.
