Home Page
Archive > Posts > Tags > UTF8
Search:

Weird filename encoding issues on windows

So somehow all of the file names in my Rammstein music directory, and some in my Daft Punk, had characters with diacritics replaced with an invalid character. I pasted one of such filenames into a hex editor to evaluate what the problem was. First, I should note that Windows encodes its filenames (and pretty much everything) in UTF16. Everything else in the world (mostly) has settled on UTF8, which is a much better encoding for many reasons. So during some file copy/conversion at some point in the directories’ lifetime, the file names had done a freakish (utf16*)(utf16->utf8) rename, or something to that extent. I had noticed that all I needed to do was to replace the first 2 bytes of the diacritic character with a different byte. Namely “EF 8x” to “Cx”, and the rest of the bytes for the character were fine. So if anyone ever needs it, here is the bash script.

LANG=;
IFS=$'\n'
for i in `find -type f | grep -P '\xEF[\x80-\x8F]'`; do
	FROM="$i";
	TO=$(echo "$i" | perl -pi -e 's/\xEF([\x80-\x8F])/pack("C", ord($1)+(0xC0-0x80))/e');
	echo Renaming "'$FROM'" to "'$TO'"
	mv "$FROM" "$TO"
done

I may need to expand the range beyond the x80-x8F range, but am unsure at this point. I only confirmed the range x82-x83.

UTF8 fix for iOS mail
Thanks apple</sarcasm>

Nowadays, it just makes sense to do everything in utf8 from the start. Most everything of consequence now supports it, it keeps byte count down for most anyone using an alphabet that is primarily latin, it is compatible with a old of older software, and can help lead to some useful shortcuts in programming. A very good number of libraries and languages even assume it as the default character encoding now.

It’s especially nice for html and web pages, so content can just be pasted in as-is, except of course for single and double quotes (which I always replace with ’ “ ” anyways) and < > &. However, I was recently thrown for a loop when my utf8 encoding was not being read correctly in Apple’s iOS mail client. After lots of testing and research, I came to the conclusion that it just plain doesn’t support utf-8 by default! I think I read somewhere you have to install foreign language packs to make it available, but that doesn’t help our normal latin-based-alphabet users.


The simple solution for diacritic (and other html encodable) characters is to just encode them as html: (only supported for php>=5.3.4)
function EncodeForIOS($String) //This assumes the following characters are already encoded (line 2): & < >
{
	$MailHTMLConvertArray=get_html_translation_table(HTML_ENTITIES, ENT_NOQUOTES, 'UTF-8');
	unset($MailHTMLConvertArray['&'], $MailHTMLConvertArray['<'], $MailHTMLConvertArray['>']);
	return strtr($String, $MailHTMLConvertArray);
}
UTF8 BOM
When a good idea is still considered too much by some

While UTF-8 has almost universally been accepted as the de-facto standard for Unicode character encoding in most non-Windows systems (mmmmmm Plan 9 ^_^), the BOM (Byte Order Marker) still has large adoption problems. While I have been allowing my text editors to add the UTF8 BOM to the beginning of all my text files for years, I have finally decided to rescind this practice for compatibility reasons.

While the UTF8 BOM is useful so that editors know for sure what the character encoding of a file is, and don’t have to guess, they are not really supported, for their reasons, in Unixland. Having to code solutions around this was becoming cumbersome. Programs like vi and pico/nano seem to ignore a file’s character encoding anyways and adopt the character encoding of the current terminal session.

The main culprit in which I was running into this problem a lot with is PHP. The funny thing about it too was that I had a solution for it working properly in Linux, but not Windows :-).

Web browsers do not expect to receive the BOM marker at the beginning of files, and if they encounter it, may have serious problems. For example, in a certain browser (*cough*IE*cough*) having a BOM on a file will cause the browser to not properly read the DOCTYPE, which can cause all sorts of nasty compatibility issues.

Something in my LAMP setup on my cPanel systems was removing the initial BOM at the beginning of outputted PHP contents, but through some preliminary research I could not find out why this was not occurring in Windows. However, both systems were receiving multiple BOMs at the beginning of the output due to PHP’s include/require functions not stripping the BOM from those included files. My solution to this was a simple overload of these include functions as follows (only required when called from any directly opened [non-included] PHP file):

<?
/*Safe include/require functions that make sure UTF8 BOM is not output
Use like: eval(safe_INCLUDETYPE($INCLUDE_FILE_NAME));
where INCLUDETYPE is one of the following: include, require, include_once, require_once
An eval statement is used to maintain current scope
*/

//The different include type functions
function safe_include($FileName)	{ return real_safe_include($FileName, 'include'); }
function safe_require($FileName)	{ return real_safe_include($FileName, 'require'); }
function safe_include_once($FileName)	{ return real_safe_include($FileName, 'include_once'); }
function safe_require_once($FileName)	{ return real_safe_include($FileName, 'require_once'); }

//Start the processing and return the eval statement
function real_safe_include($FileName, $IncludeType)
{
	ob_start();
	return "$IncludeType('".strtr($FileName, Array("\\"=>"\\\\", "'", "\\'"))."'); safe_output_handler();";
}

//Do the actual processing and return the include data
function safe_output_handler()
{
	$Output=ob_get_clean();
	while(substr($Output, 0, 3)=='?') //Remove all instances of UTF8 BOM at the beginning of the output
		$Output=substr($Output, 3);
	print $Output;
}
?>

I would have like to have used PHP’s output_handler ini setting to catch even the root file’s BOM and not require include function overloads, but, as php.net puts it “Only built-in functions can be used with this directive. For user defined functions, use ob_start().”.

As a bonus, the following bash command can be used to find all PHP files in the current directory tree with a UTF8 BOM:

grep -rlP "^\xef\xbb\xbf" . | grep -iP "\.php\$"

[Edit on 2015-11-27]
Better UTF8 BOM file find code (Cygwin compatible):
 find . -name '*.php' -print0 | xargs -0 -n1000 grep -l $'^\xef\xbb\xbf'
And to remove the BOMs (Cygwin compatible):
find . -name '*.php' -print0 | xargs -0 -n1000 grep -l $'^\xef\xbb\xbf' | xargs -i perl -i.bak -pe 'BEGIN{ @d=@ARGV } s/^\xef\xbb\xbf//; END{ unlink map "$_$^I", @d }' "{}"
Simpler remove BOMs (not Cygwin/Windows compatible):
find . -name '*.php' -print0 | xargs -0 -n1000 grep -l $'^\xef\xbb\xbf' | xargs -i perl -i -pe 's/^\xef\xbb\xbf//' "{}"