Page History: TextFileEncodingDetector project
Compare Page Revisions
Page Revision: 2011-04-29 05:39
There's an awkward situation on Windows machines (and, I suspect, more generally) - text files, and text-based files like CSV files, etc, can be saved in any number of encodings: windows codepages, less-common encodings such as EBCDIC, and more modern encodings like UTF-8 and UTF-16.
The newer Unicode formats have a standard for "self-describing" the encoding, in the form of a Byte Order Mark, but this is often not present, and in fact actively discouraged by the unicode consortium, in the case of UTF-8.
For UTF-8 in particular, this poses a problem because UTF-8 encoding looks a whole lot like ASCII/ANSI/Windows-1252/Latin-1, a family of related encodings commonly used and confused on Windows systems and nowadays globally.
The "Correct" thing to do, when presented with a text file, is to:
- Check for a BOM, indicating a Unicode file of some specific type
- If not found, ask the user what encoding was used (preferably providing suggestions with a "most likely) order).
Or at least, this is the opinion of many developers, see
this stack overflow question and the linked seminal rant by Joel Spolsky,
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Now in the real world, most users don't know what encoding their files use, and on windows machines in western and particularly english-speaking countries, the number of options commonly encountered is quite limited:
- Windows-1252 (a superset of Latin-1, which itself is a superset of US-ASCII)
- UTF-8, with or without BOM
- UTF-16, LE or BE, with or without BOM
Automatically determining which of these a text file uses is, 99% of the time, quite straightforward, but I couldn't find any libraries that do it - nothing in the .Net framework, no usable code snippets online (beyond trivial BOM detection), simply no easy way to do it.
So there it is. A simple class that automatically detects with of these encodings a file probably uses, for when your users don't have a clue. If they do get a choice, please please get them to use Unicode or UTF-8 with BOM! It makes things sooo much easier...
Now, some caveats:
- If your application design permits it, it's still preferable to provide some sort of preview and selection dialog.
- After writing this, I came across a library on codeproject that wraps MLang to do something very similar: Detect Encoding for In- and Outgoing Text. I haven't tested this, but it may be more appropriate in some situations (especially in multi-lingual environments).
I may take the time to run some tests and turn this snippet into an actual library (assuming the MLang-based solution doesn't beat the pants off it) at some point.
Any feedback would be wonderful!