URL Encoding

What is URL encoding?

URL encoding is the practice of translating unprintable characters or characters with special meaning within URLs to a representation that is unambiguous and universally accepted by web browsers and servers. These unprintable characters include:

ASCII control characters
Unprintable characters typically used for output control. Character ranges 00-1F hex (0-31 decimal) and 7F (127 decimal).
Non-ASCII control characters
These are characters beyond the ASCII character set of 128 characters. This range is part of the ISO-Latin character set however.
Reserved characters
These are special characters such as the dollar sign, ampersand, plus, common, forward slash, colon, semi-colon, equals sign, question mark, and "at" symbol. All of these can have different meanings inside a URL so need to be encoded.

Why encode URLs?

Unprintable characters and characters with special meaning can cause a different result than you might expect.

  1. The URL specification RFC1738 specifies a limited set of allowable characters in a URL but HTML allows the entire ISO-8859-1 (ISO-Latin) character range (and HTML4 goes further to include the Unicode character set).
  2. In practice, not every browser and web server will necessarily react the same way to "unexpected" characters in a URL. Control characters may be ignored, and characters with special meaning may cause the web server to interpret your URL in different ways that you didn't expect.

So effectively if you are using characters in a URL outside of alphanumeric, the hyphen, and the underscore you are at risk of misinterpretation and need to URL encode the sequence.

When should I encode URLs?

Everywhere that a URL can be defined is a candidate for encoding. That means all HTML tags as well as direct entry in the address window of a browser. Also in POST and GET requests from forms.

You should consider any character that is not a letter or number a candidate. Spaces in particular should be encoded.

Note that you must never encode the http:// portion - it needs to remain in the clear as part of the syntax of URLs.

How to encode URLs

By hand

Conceptually this is very simple. The URL encoding of any character consists of a leading percent symbol followed by the 2-digit hexadecimal code of that character in the ISO-Latin character set.

For example: the space character is at position 0x20 hexadecimal (decimal 32). To URL encode the space character you would encode as %20

PHP

Use the library funcions urlencode() and urldecode() found in PHP since version 3.

Perl

There are a million ways to craft regular expressions and replacements in Perl. Or use the CGI perl module.

Encode with the escape() and escapeHTML functions.

use CGI;
$url = "foo<b>bar.html";
print CGI::escapeHTML($url);
print CGI::escape($url);

To decode let CGI.pm handle this transparently for you.

$query = CGI::new();
$data = $query->param("name");

The $data string will have all URL encoded values converted for you.

Java

Use the Java library method encode() within java.net.URLEncoder which extends java.lang.Object.

Javascript

Use the String.charCodeAt and String.fromCharCode functions but note only available in Javascript version 1.2 or higher. Therefore, not all browswers will support these functions. Any version of Opera, Netscape, or Internet Explorer below version 3 will probably not work.

Python

import urllib;
urllib.quote("string");

References