URL encoding converts non-ASCII characters into a format that can be transmitted over the Internet. URL encoding replaces non-ASCII characters with a “%” followed by hexadecimal digits. URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign, or %20.
URL encoding is a mechanism for translating unprintable or special characters to a universally accepted format by web servers and browsers. The encoding of information can be applied to Uniform Resource Names (URNs), Uniform Resource Identifiers (URIs) and Uniform Resource Locators (URLs), and selected characters in the URL are replaced by one or more character triplets comprised of the percent character and two hexadecimal digits. The hexadecimal digits in the character triplets represent the numerical value of the characters that are replaced. URL encoding is widely used in HTML form data submission in HTTP requests.
URL encoding is also known as percent-encoding.
Details of the URL-encoding
Types of URI characters
The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding). Reserved characters are characters that sometimes have special meaning. For example, forward slash characters are used to separate different parts of a URL (or more generally, a URI). Unreserved characters have no such special meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each new revision of specifications that govern URIs and URI schemes.
RFC 3986 section 2.2 Reserved Characters (January 2005) | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
! | * | ' | ( | ) | ; | : | @ | & | = | + | $ | , | / | ? | # | [ | ] |
RFC 3986 section 2.3 Unreserved Characters (January 2005) | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z |
a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | - | _ | . | ~ |
Other characters in a URI must be percent encoded.
Percent-encoding reserved characters
When a character from the reserved set (a “reserved character”) has special meaning (a “reserved purpose”) in a particular context and a URI scheme says that it is necessary to use that character for some other purpose, then the character must be percent-encoded. Percent-encoding a reserved character means converting the character to its corresponding byte value in ASCII and then representing that value as a pair of hexadecimal digits. The digits, preceded by a percent sign (“%”), are then used in the URI in place of the reserved character. (For a non-ASCII character, it is typically converted to its byte sequence in UTF-8, and then each byte value is represented as above.)
The reserved character “/”, for example, if used in the “path” component of a URI, has the special meaning of being a delimiter between path segments. If, according to a given URI scheme, “/” needs to be in a path segment, then the three characters “%2F” (or “%2f”) must be used in the segment instead of a “/”.
Reserved characters after percent-encoding | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
! | # | $ | & | ' | ( | ) | * | + | , | / | : | ; | = | ? | @ | [ | ] |
%21 | %23 | %24 | %26 | %27 | %28 | %29 | %2A | %2B | %2C | %2F | %3A | %3B | %3D | %3F | %40 | %5B | %5D |
Reserved characters that have no reserved purpose in a particular context may also be percent-encoded but are not semantically different from other characters.
In the “query” component of a URI (the part after a “?” character), for example, “/” is still considered a reserved character but it normally has no reserved purpose (unless a particular URI scheme says otherwise). The character does not need to be percent-encoded when it has no reserved purpose.
URIs that differ only by whether a reserved character is percent-encoded or not are normally considered not equivalent (denoting the same resource) unless it is the case that the reserved characters in question have no reserved purpose. This determination is dependent upon the rules established for reserved characters by individual URI schemes.
Percent-encoding unreserved characters
Characters from the unreserved set never need to be percent-encoded.
URIs that differ only by whether an unreserved character is percent-encoded or not are equivalent by definition, but URI processors, in practice, may not always treat them equivalently. For example, URI consumers shouldn’t treat “%41” differently from “A” (“%41” is the percent-encoding of “A”) or “%7E” differently from “~”, but some do. For maximum interoperability, URI producers are therefore discouraged from percent-encoding unreserved characters.
Percent-encoding the percent character
Because the percent (“%”) character serves as the indicator for percent-encoded octets, it must be percent-encoded as “%25” for that octet to be used as data within a URI.
Percent-encoding arbitrary data
Most URI schemes involve the representation of arbitrary data, such as an IP address or file system path, as components of a URI. URI scheme specifications should, but often don’t, provide an explicit mapping between URI characters and all possible data values being represented by those characters.
Follow Me
If you like my post please follow me to read my latest post on programming and technology.
Leave a Comment