摘要:在Network Woring Group的URL的RFC(From Wiki:一系列以编号排定的文件。文件收集了有关互联网相关信息,以及UNIX和互联网社区的软件文件。基本的互联网通信协议都有在RFC文件内详细说明。)第1738号Request For Comments里规定了URLs只能使用US-ASCII字符集。本文就URL编码问题,摘录了编码的对象和规则,供扫盲使用。

RFC 1738:URL的说明

RFC 1738对URL使用的字符集说明:”…Only alphanumerics [0-9a-zA-Z], the special characters “$-_.+!*'(),” [not including the quotes – ed], and reserved characters used for their reserved purposes may be used unencoded within a URL.”

而URL所指名的资源,如HTML允许的字符集是ISO-8859-1(ISO-Latin),甚至HTML4包含整个Unicode字符集。

所以所有出现在HTML中指向特定资源,比如(AAPPLETAREABASEBGSOUNDBODYEMBED,FORMFRAMEIFRAMEILAYERIMGISINDEXINPUTLAYERLINKOBJECTSCRIPTSOUNDTABLETDTH, and TR elements.) 的URL,都该被编码。

那些字符应该被编码,为什么要被编码

ASCII Control characters
Why: These characters are not printable.
Characters: Includes the ISO-8859-1 (ISO-Latin) character ranges 00-1F hex (0-31 decimal) and 7F (127 decimal.)
Non-ASCII characters
Why: These are by definition not legal in URLs since they are not in the ASCII set.
Characters: Includes the entire “top half” of the ISO-Latin set 80-FF hex (128-255 decimal.)
“Reserved characters”
Why: URLs use some characters for special use in defining their syntax. When these characters are not used in their special role inside a URL, they need to be encoded.
Characters:
Character Code
Points
(Hex)
Code
Points
(Dec)
 Dollar (“$”)
Ampersand (“&”)
Plus (“+”)
Comma (“,”)
Forward slash/Virgule (“/”)
Colon (“:”)
Semi-colon (“;”)
Equals (“=”)
Question mark (“?”)
‘At’ symbol (“@”)
24
26
2B
2C
2F
3A
3B
3D
3F
40
36
38
43
44
47
58
59
61
63
64
“Unsafe characters”
Why: Some characters present the possibility of being misunderstood within URLs for various reasons. These characters should also always be encoded.
Characters:
Character Code
Points
(Hex)
Code
Points
(Dec)
Why encode?
Space 20 32 Significant sequences of spaces may be lost in some uses (especially multiple spaces)
Quotation marks
‘Less Than’ symbol (“<“)
‘Greater Than’ symbol (“>”)
22
3C
3E
34
60
62
These characters are often used to delimit URLs in plain text.
‘Pound’ character (“#”) 23 35 This is used in URLs to indicate where a fragment identifier (bookmarks/anchors in HTML) begins.
Percent character (“%”) 25 37 This is used to URL encode/escape other characters, so it should itself also be encoded.
Misc. characters:
Left Curly Brace (“{“)
Right Curly Brace (“}”)
Vertical Bar/Pipe (“|”)
Backslash (“”)
Caret (“^”)
Tilde (“~”)
Left Square Bracket (“[“)
Right Square Bracket (“]”)
Grave Accent (“`”)
7B
7D
7C
5C
5E
7E
5B
5D
60
123
125
124
92
94
126
91
93
96
Some systems can possibly modify these characters.

那么URL怎样被编码

编码后的URL包含“%”符号,后面跟两位16禁止大小写敏感的ISO-Latin字符

Example
  • Space = decimal code point 32 in the ISO-Latin set.
  • 32 decimal = 20 in hexadecimal
  • The URL encoded representation will be “%20”