Tuesday, September 20, 2011

URI->path expects binary data

Update: changed new to path - with new it would be reasonable to require that the uri fed to the parser is already an ASCI string containing the already URI encoded url.
Consider this code:
use 5.010;
use Encode 'encode';
use URI;

my $uri = URI->new( 'http://example.com/' );
say $uri->path( encode("UTF-8", "can\x{00B4}t-make-it-work" ) );
say $uri->path( "can\x{00B4}t-make-it-work" );

The output (in perl 5.14.0) is:
http://example.com/can%C2%B4t-make-it-work
http://example.com/can%B4t-make-it-work

If your page is encoded in UTF8 - then the first one is correct: %C2%B4 is the URI encoded UTF8 encoding of Unicode Character 'ACUTE ACCENT' (U+00B4). If your page encoding is Latin1 - then the second one would be correct - but this is only by accident - in that case you should still use encode("iso-8859-1", ...).

There are probably many other string manipulating libs that should document if their input should be binary encoded data or decoded character strings.

No comments: