Monday, October 03, 2011

open expects filename as binary data encoded in the system characterset

I guess this is not a surprise to anyone who thought about how this is supposed to work, but for the sake of being systematic, here is the code:
use strict;
use warnings;
use autodie;
use HTML::Entities;
use Encode;

my $a = HTML::Entities::decode( 'ñ' );

open(my $fh, '>', $a );
print $fh "Without encoding\n";
close $fh;

open(my $fh1, '>', encode( 'UTF-8', $a ) );
print $fh1 "With encoding\n";
close $fh1

And here is the result when run on an system with UTF8 locales:
zby@zby:~/myopera/tmp$ ls
? a.pl ñ
zby@zby:~/myopera/tmp$ cat ñ
With encoding

'a.pl' is the name of the script itself, the mark '?' hides the F1 hexadecimal code and that file contains 'Without encoding'.

6 comments:

LeoNerd said...

Character-to-byte encodings to keep cropping up in many places, far more than just the "bytes of a file stream" that people tend to think of. As far as I'm aware there's no control in Perl to set the encoding of characters to pass to, or expect from, filename-related syscalls (open, stat, readdir, etc...), so you have to do the Encode dance yourself here.

zby said...

Yeah - this is my point. I try to show concrete examples each time - but I think it is now more or less evident that there are lot's of cases like this.

dolmen said...

The heart of the problem is that some systems have decided that filenames in system API calls are bytes, not characters and that it is the task of the underlying filesystem to interpret them into chars (or not).
This is the case for Linux.

abraxxa said...

A cross-OS language like Perl should be platform independent wherever possible, definitely for such common things like file system access.
So what's the best way to deal with that?
Compile time options that define the default encoding for filenames, STDIN/STDOUT, etc. or a startup detection?

Sid Burn said...

Well, it is cross-platform. Nobody can tell you the encoding of the filename. Even if your default is set to UTF-8. The filename encoding can be some other encoding. So threating everything binary is the only correct way.

What do you expect? Create all possible encodings and then open a file randomly that matches? And the same string can have multiple different encodings. And all of them can be in the same directory. Because filesystems don't knew anything about encodings.

zby said...

@Sid Burn - what I would expect? I would expect that the documentation for the 'open' function defined this. I think bytes is the only practical solution for now - but it should be documented. The examples you mention are complete straw man, but the compile time options mentioned by abraxxa above could work.