摘要:在Network Woring Group的URL的RFC(From Wiki:一系列以编号排定的文件。文件收集了有关互联网相关信息,以及UNIX和互联网社区的软件文件。基本的互联网通信协议都有在RFC文件内详细说明。)第1738号Request For Comments里规定了URLs只能使用US-ASCII字符集。本文就URL编码问题,摘录了编码的对象和规则,供扫盲使用。
RFC 1738:URL的说明
RFC 1738对URL使用的字符集说明:”…Only alphanumerics [0-9a-zA-Z], the special characters “$-_.+!*'(),” [not including the quotes – ed], and reserved characters used for their reserved purposes may be used unencoded within a URL.”
而URL所指名的资源,如HTML允许的字符集是ISO-8859-1(ISO-Latin),甚至HTML4包含整个Unicode字符集。
所以所有出现在HTML中指向特定资源,比如(A, APPLET, AREA, BASE, BGSOUND, BODY, EMBED,FORM, FRAME, IFRAME, ILAYER, IMG, ISINDEX, INPUT, LAYER, LINK, OBJECT, SCRIPT, SOUND, TABLE, TD, TH, and TR elements.) 的URL,都该被编码。
那些字符应该被编码,为什么要被编码
ASCII Control characters | ||
Why: | These characters are not printable. | |
Characters: | Includes the ISO-8859-1 (ISO-Latin) character ranges 00-1F hex (0-31 decimal) and 7F (127 decimal.) |
Non-ASCII characters | ||
Why: | These are by definition not legal in URLs since they are not in the ASCII set. | |
Characters: | Includes the entire “top half” of the ISO-Latin set 80-FF hex (128-255 decimal.) |
“Reserved characters” | |||||||
Why: | URLs use some characters for special use in defining their syntax. When these characters are not used in their special role inside a URL, they need to be encoded. | ||||||
Characters: |
|
“Unsafe characters” | |||||||||||||||||||||||||
Why: | Some characters present the possibility of being misunderstood within URLs for various reasons. These characters should also always be encoded. | ||||||||||||||||||||||||
Characters: |
|
那么URL怎样被编码
编码后的URL包含“%”符号,后面跟两位16禁止大小写敏感的ISO-Latin字符
- Example
-
- Space = decimal code point 32 in the ISO-Latin set.
- 32 decimal = 20 in hexadecimal
- The URL encoded representation will be “%20”
Recent Comments