TextFileEncodingDetector project

Modified on 2012-05-02 09:00 by TaoK Categorized as C#, dotNet 2
There's an awkward situation on Windows machines (and, I suspect, more generally) - text files, and text-based files like CSV files, etc, can be saved in any number of encodings: windows codepages, less-common encodings such as EBCDIC, and more modern encodings like UTF-8 and UTF-16.

The newer Unicode formats have a standard for "self-describing" the encoding, in the form of a Byte Order Mark, but this is often not present, and in fact actively discouraged by the unicode consortium, in the case of UTF-8.

For UTF-8 in particular, this poses a problem because UTF-8 encoding looks a whole lot like ASCII/ANSI/Windows-1252/Latin-1, a family of related encodings commonly used and confused on Windows systems and nowadays globally.

The "Correct" thing to do, when presented with a text file, is to:
# Check for a BOM, indicating a Unicode file of some specific type
# If not found, ask the user what encoding was used (preferably providing suggestions with a "most likely) order). 

Or at least, this is the opinion of many developers, see [http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file|this stack overflow question] and the linked seminal rant by Joel Spolsky, [http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html|The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)].

Now in the real world, most users don't know what encoding their files use, and on windows machines in western and particularly english-speaking countries, the number of options commonly encountered is quite limited:
* Windows-1252 (a superset of Latin-1, which itself is a superset of US-ASCII)
* UTF-8, with or without BOM
* UTF-16, LE or BE, with or without BOM

Automatically determining which of these a text file uses is, 99% of the time, quite straightforward, but I couldn't find any libraries that do it - nothing in the .Net framework, no usable code snippets online (beyond trivial BOM detection), simply no easy way to do it.

So there it is. A simple class that automatically detects with of these encodings a file probably uses, for when your users don't have a clue. If they do get a choice, please please get them to use Unicode or UTF-8 with BOM! It makes things sooo much easier...

Now, some caveats:
* If your application design permits it, it's still preferable to provide some sort of preview and selection dialog.
* After writing this, I came across a library on codeproject that wraps MLang to do something very similar: [http://www.codeproject.com/KB/recipes/DetectEncoding.aspx|Detect Encoding for In- and Outgoing Text]. I haven't tested this, but it may be more appropriate in some situations (especially in multi-lingual environments). 
* Just today, I read about another project that does something that sounds very similar: [http://utf8checker.codeplex.com/|UTF8Checker] on codeplex. Again, I haven't tested this, although it sounds like a subset of what the class below does.

I may take the time to run some tests and turn this snippet into an actual library (assuming the MLang-based solution doesn't beat the pants off it) at some point.

Any feedback would be wonderful! (note: this is a Gist on GitHub, feel free to fork/edit/etc)

'''Please Note:''' A couple of additional considerations have come up recently:
* Eric Popivker reported an exception under some circumstances, the fix should be checked in soon.
* He also noted that MLang doesn't always detect Unicode encodings correctly, and that a hybrid approach worked best for him; first checking for unicode encodings with the code below, and then using unmanaged MLang (nicely wrapped in [http://www.codeproject.com/Articles/17201/Detect-Encoding-for-In-and-Outgoing-Text|Carsten Zeumer's famous "EncodingTools.dll" project]). This is done in [http://findandreplace.codeplex.com/workitem/2|his open-source find-and-replace tool, fnr.exe].
* He's also noted that the code below (and MLang) doesn't do anything to avoid binary files, which you <u>usually</u> don't want to treat as text files (chances are that if you're trying to auto-detect the encoding, you're not planning to handle arbitrary binary content). [http://www.entechsolutions.com/how-to-detect-if-file-is-text-or-binary-using-c|He mentions] a simple detection heuristic, looking for a sequence of 4 binary nulls in the raw bytestream, as a so-far-reliable way to separate binary files from text files.
* I'm hoping / planning to wrap this hybrid-and-binary-detection approach into a small encoding-detection library at some point, but I have no timeline established (weeks/months/years).

<script src="https://gist.github.com/945127.js?file=TextFileEncodingDetector.cs"></script>
Meta Keywords:
Meta Description:
Change Comment:
Architect Shack

Navigation

TextFileEncodingDetector project