UTF-8 is NOT ASCII!

At my current client we had an interesting (and probably very common issue).  The project we're on is to integrate my client's internal accounting applications (a mixture of Microsoft Dynamics 6.5 and custom applications) with "GetPaid" from Sungard.  The vendor asked for a file, we asked what encoding they wanted and they stated "UTF-8".  We produced the software to extract the data and run the vendor's utility code to load it - and it fails because of "bad data" in the comments field.

The comments field is a free form field for a customer.  My client's intrepid QA Tester showed screen shots of the data in the Accounting System that had no issues, but did include a backwards quote (apostrophe, accent grave or whatever you want to call it).  My colleague from Magenic Scott Janssens had written a general-purpose C# utility that took an XML document as input which would specify a stored procedure name and specifications to convert columns from the query to text.  As instructed, Scott specified UTF-8 encoding.

Scott did some research (a cautious developer always takes into consideration THEY did something wrong).  A further bit of confusion being the database in question was SQL 2000.  SQL 2000 doesn't use UTF-8 but the now obsolete UCS-2.  But ADO.NET makes that conversion without any issue.

Scott researched the issue and explains it as well (or better) than I can as follows:

Here's the skinny on encoding:

Firstly, UTF-8 is not ASCII.  I had this wrong and I suspect SunGard is mistaken in the same way I was.

ASCII (aka ANSI, aka Plain Text) stores a character using a single byte.

UTF-8 is a Unicode encoding which stores a character using 1, 2, 3, or 4 bytes.

In UTF-8 all the ASCII characters are represented with one byte.  This makes UTF-8 backwards compatible with ASCII.

The problem we're seeing is when the UTF-8 text in the database contains non-ASCII characters.  In the case below, the character is a Unicode apostrophe.  This is different than the ASCII apostrophe.  The Unicode apostrophe is represented by three bytes.  When read with the proper encoding these three bytes display as an apostrophe.  When read with the ASCII encoding, each of the three bytes is displayed (because the encoding assumes each byte is a character) with whatever ASCII characters the byte values correspond to.  The result is gibberish.

These values in the database are probably the result of the user cutting and pasting text from Word or Outlook.

The file we're sending Sungard is currently encoded in UTF-8.  I suspect they want it in ASCII.

Depending on the needs of the project sending the data in ASCII may or may not be an issue.  When converting to ASCII, any character that doesn't exist in ASCII is replaced with a question mark.  So in our example below, "doesn't" becomes "doesn?t".  If we need to keep the strings as they appear in our database, Sungard will need to properly read UTF-8.  If we don't care about non-ASCII characters appearing as question marks, then we can send the data to Sungard in ASCII format.

The change to make the converter output ASCII is a one line change, changing the encoding object from UTF8Encoding to ASCIIEncoding.

Interesting stuff - good work Scott!

Published 18 June 08 11:42 by DanielS

Comments

# Fuzzy's Blogic said on August 27, 2008 3:35 PM:

I just finished implementing VB6 build for my client. It was pretty easy, but I learned a few things.

Anonymous comments are disabled