The pitfalls in the HTTP specification

Posted May 28, 20209 min read

The HTTP protocol is arguably the most familiar network protocol for developers. The "simple and easy to understand" and "easy to expand" features make it the most widely used application layer protocol.

Although there are many advantages, because of the many games and restrictions in the definition of the protocol, many hidden pits are hidden, and people will fall into it if they are not careful. This article summarizes several common pits in the HTTP specification, I hope you consciously avoid them in development and improve the development experience.

1.Referer

The HTTP standard writes Referrer as Referer(less one r), arguably the most famous typo in computer history.

The main function of Referer is to carry the source address of the current request, commonly used in anti-reptile and anti-theft chains. Some days ago, the uproarious Sina bed flip chart event was because Sina bed suddenly started checking the HTTP Referer header, and non-Sina domain names did not return images, resulting in a lot of small and medium blogs rubbing traffic.

Although "Referer" is wrongly written in the HTTP standard, other standards that can control "Referer" do not make mistakes.

For example, it is forbidden for web pages to automatically carry the tag in the Referer header, and the related keywords are spelled correctly:

<!-Globally prohibit sending referrer->
<meta name = "referrer" content = "no-referrer" />

Another thing worth noting is the browser's network request. Considering security and stability, request headers such as Referer can only be controlled by the browser during network requests, and cannot be directly operated. We can only control them through some attributes. For example, Fetch function, we can control it through referrer and referrerPolicy, and their spelling is correct:

fetch('/page', {
  headers:{
    "Content-Type":"text/plain; charset = UTF-8"
  },
  referrer:"https://demo.com/anotherpage", //<-
  referrerPolicy:"no-referrer-when-downgrade", //<-
});

Summary in one sentence:

All references to Referrer, except the HTTP field is wrong, the spelling of the relevant configuration field of the browser is correct.

  1. "Spiritual" spaces

1. %20 or+?

This is an epic pit **. I was pitted by this agreement for a day.

Before starting to explain, let's look at a small test, enter blank test in the browser(there is a space between blank and test), let's see how the browser handles it:

It can be seen from the animation that the browser interprets the space as a plus sign "+".

Does it feel strange? Let's do another test, try it with a few functions provided by the browser:

encodeURIComponent("blank test") //"blank%20test"
encodeURI("q = blank test") //"q = blank%20test"
new URLSearchParams("q = blank test"). toString() //"q = blank + test"

image-20200524184735653

The code will not lie, in fact, the above results are correct , the encode results are different, because URI specification and W3C specification conflicts **, this kind of confusing Oolong event will come out.

2. Conflicting agreements

We first look at the [reserved words]in the URI( https://tools.ietf.org/html/rfc3986#section-2.2 ), these reserved words do not participate in encoding. There are two main categories of reserved characters:

  • gen-delims:: / ? # [ ] @
  • sub-delims:! $ & ' ( ) * + , ; `` = `

The encoding rule of URI is also very simple, first convert the unrestricted range of characters to hexadecimal, and then add a percent sign in front.

The unsafe character such as space is 0x20 in hexadecimal, followed by the percent sign % is %20:

image-20200524184601512

So at this time, looking at the encoding results of encodeURIComponent and encodeURI, it is completely correct.

Now that the space conversion to %20 is correct, what is the conversion to+? At this time we have to understand the history of HTML form.

When the early web pages did not have AJAX, the submitted data were all through the HTML form. The form submission method can be GET or POST. You can test it on MDN form entry :

After testing, we can see that in the content submitted by the form, the spaces are all converted into plus signs **, this type of encoding is application/x-www-form-urlencoded, in WHATWG specification is defined like this:

image-20200524185912590

At this point, the case is basically solved. When URLSearchParams is encoded, it comes according to this specification. I found the [Polyfill code]of URLSearchParams( https://github.com/WebReflection/url-search-params/blob/814161e99f1dd4453f3c1dc594bc73da2bd61838/build/url-search-params.node.js#L88 ) Mapped %20 to+:

replace = {
    '!':'%twenty one',
    "'":'%27 ',
    '(':'%28',
    ')':'%29',
    '~':'%7E',
    '%20':'+', //<= this is it
    '%00':'\ x00'
}

The specification also explains this encoding type:

The application/x-www-form-urlencoded format is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices. In particular, readers are cautioned to pay close attention to the twisted details involving repeated(and in some cases nested) conversions between character encodings and byte sequences. Unfortunately the format is in widespread use due to the prevalence of HTML forms.

This encoding method is not a good design. Unfortunately, with the popularity of HTML form, this format has been promoted

In fact, a large sentence above means just one thing:This thing is designed ?, but it s hard to come back, let s bear with it

3. One sentence summary

  • In the URI specification, the space encode is %20, and in the format of application/x-www-form-urlencoded, the space encode is +
  • In actual business development, it is best to use the industry's mature HTTP request library to encapsulate the request. These miscellaneous and tiring frameworks have been done;
  • If you must use native AJAX to submit data in application/x-www-form-urlencoded format, do not manually stitch the parameters, use URLSearchParams to process the data, which can avoid various disgusting encoding conflicts.
  1. Is the real IP obtained by X-Forwarded-For?

1. Story

Before the beginning of this section, I first tell a small story in development, you can deepen your understanding of this field.

Some time ago to do a demand related to risk control, you need to get the user's IP. After development, grayscale has a small number of users. The test found that the grayscale user IP in the background log is all abnormal, how can it What a coincidence. Then the test sent several abnormal IP:

10.148.2.122
10.135.2.38
10.149.12.33
...

I can see from the IP characteristics that these IPs all start with 10 and belong to the private IP range of Class A IP(10.0.0.0-10.255.255.255). The backend must be the IP of the proxy server, and Not the user's real IP.

2. Principle

image-20200524154345598

Nowadays, websites of some sizes are basically not single-point servers. In order to cope with higher traffic and more flexible architecture, application services are generally hidden behind proxy servers, such as Nginx.

After joining the access layer, we can easily achieve load balancing and service upgrade of multiple servers, of course, there are other benefits, such as better content caching and security protection, but these are not the focus of this article Too.

After the website joins the proxy server, in addition to the above advantages, it also introduces some new problems. For example, in the previous single-point server, the server can directly get the user's IP. After joining the proxy layer, as shown in the figure above,(application) the original server got the proxy server's IP. The problem of the story I talked about earlier Right here.

There is definitely a ready-made solution in such a mature field of Web development, that is, the X-Forwarded-For request header.

X-Forwarded-For is a de facto standard. Although it is not written in the HTTP RFC specification, it can actually be regarded as the HTTP specification in terms of popularity.

This standard is defined in this way, every time the proxy server forwards the request to the next server, the proxy server's IP must be written into X-Forwarded-For so that the application service at the end receives the request , You will get an IP list:

X-Forwarded-For:client, proxy1, proxy2

Because the IP is pushed in one by one, then the first IP is the user's real IP, just use it.

However, is it so simple?

3. Attack

From a security point of view, the most insecure thing in the entire system is people, and the user end is the best to break and the best to forge. Some users have started to exploit the loopholes in the protocol:X-Forwarded-For was added by the proxy server. If I added X-Forwarded-For to the header of the request, would n t I deceive the server?

1. First send a request from the client, with the X-Forwarded-For request header, which writes a fake IP:

X-Forwarded-For:fakeIP

2. The server's first-tier proxy service received the request and found that there was already an "X-Forwarded-For", and mistakenly regarded the request as a proxy server, so the real IP of the client was added to this field:

X-Forwarded-For:fakeIP, client

3. After several layers of proxies, the header obtained by the final server looks like this:

X-Forwarded-For:fakeIP, client, proxy1, proxy2

If you follow the idea of taking the first IP of X-Forwarded-For, you will be attacked by the attacker. What you get is fakeIP, not client IP.

4. Breaking tricks

How to break the server? The above three steps:

  • The first step is the fraud of the client, the server cannot intervene
  • The second step is a proxy server, controllable and preventable
  • The third step is the application server, controllable and preventable

In the second step, I used the Nginx server as an example.

On the outermost layer of Nginx, the configuration of X-Forwarded-For is as follows:

proxy_set_header X-Forwarded-For $remote_addr;

What does that mean? That is, the outermost proxy server does not trust the client's X-Forwarded-For input, and directly overwrites instead of appending **.

Non-outermost Nginx server, we configure:

proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

$proxy_add_x_forwarded_for means to append IP. Through this trick, you can get rid of the fake method of the client.

The idea of breaking the trick in the third step is also very easy. For the normal idea, we take the leftmost IP of X-Forwarded-For. This time we do the opposite. From the right, subtract the proxy server The number, of the remaining IPs, the rightmost one is the real IP.

X-Forwarded-For:fakeIP, client, proxy1, proxy2

For example, we know that there are two layers of proxy services, counting from right to left, removing proxy1 and proxy2, and the rightmost IP list is the real IP.

Related ideas and code implementation can refer to Egg.js pre-proxy mode .

5. One sentence summary

When obtaining the user's real IP through X-Forwarded-For, it is best not to take the first IP to prevent the user from forging the IP.

  1. Slightly confusing separator

1.HTTP Standard

If the HTTP request header field involves multiple values, generally speaking ** each value is separated by a comma ",", even the non-RFC standard X-Forwarded-For is also separated by a comma value of:

Accept-Encoding:gzip, deflate, br
cache-control:public, max-age = 604800, s-maxage = 43200
X-Forwarded-For:fakeIP, client, proxy1, proxy2

Because the value is separated by a comma at the beginning, when you want to modify the value with a field later, the separator becomes a semicolon ";", the most typical request header is Accept:

//q = 0.9 modifies application/xml, although they are separated by semicolons
Accept:text/html, application/xml; q = 0.9, */*; q = 0.8

Although the HTTP protocol is easy to read, the use of this delimiter is not very common sense. According to common sense, the sentence-breaking tone of the semicolon is stronger than the comma, but it is reversed in the relevant fields of HTTP content negotiation. The definition here can be seen in RFC 7231 , the writing is relatively clear.

2.Cookie Standard

Contrary to conventional wisdom, cookies are not actually HTTP standards. The specification for defining cookies is RFC 6265 , so the delimiter rules are different. The Cookie Grammar Rules defined in the specification is like this:

cookie-header = "Cookie:" OWS cookie-string OWS
cookie-string = cookie-pair *(";" SP cookie-pair)

** Multiple cookies are separated by semicolon ";" instead of comma ",". I just picked a cookie on the website, it can be seen that it is separated by semicolons, and I need to pay special attention here:

image-20200524175111667

3. One sentence summary

  • The value separator of most HTTP fields is comma ","
  • Cookies are not part of the HTTP standard, the separator is a semicolon ";"

Five, article recommendation

I would like to recommend some of my articles below:


Finally, I recommend my personal public account:"Halogen Egg Lab". I usually share some front-end technology and data analysis content. If you are interested, you can pay attention to a wave: