Forum Moderators: phranque

Message Too Old, No Replies

reverse proxy encodes whitespace in the wrong way

reverse proxy converts %20 into %2520 resulting in truncated/shortened urls

         

yvesman

4:21 pm on Nov 18, 2009 (gmt 0)

10+ Year Member



Hey Everyone,

I would like to call upon your expertise on the subject of whitespace¦space being converted into %20 and then into %2520 (wrongly), and how to convert either of those to a neutral character like underscore.

I've read every thread I could find about this subject, but didn't find a clear answer.

I work on Apache 2.0/IBM HTTPServer environment acting as a reverse proxy (proxypass and proxypassreverse directives), in a virtual host. This rproxy serves up an application that allows file downloads. when I hover over a link to download a file, it's full of %20's; but when I inspect the properties of the link, its' full of %2520's, and finally, when I try to download it, the name of the file is only the first word; the rest of the string is lost.

This only happens when the app is accessed through the reverse proxy, and only in Firefox!

Any clues?

Thank you so much.

jdMorgan

10:10 pm on Nov 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well the first thing to do is to review the HTTP/1.1 RFC, and stop allowing reserved characters to be used/generated for your links...

Spaces and reserved characters such as "%" are required to be encoded by Web agents.

Then look at the exact method used to implement the reverse proxy. If by some chance, it's using mod-rewrite, then look at the [NE] flag for RewriteRule.

At any rate, there's no magic cure at the server config level -- The linking policy must be corrected first. After that, you can use a bit of mod_rewrite to call a script to issue 301 redirects to 'clean up' old links in the search engine results, but as stated, that's a 'clean-up' step.

Jim

yvesman

10:01 am on Nov 19, 2009 (gmt 0)

10+ Year Member



Hi Jim, thanks!
But I'm not sure I understand your first point.

For the reverse proxy I literally just have a separate machine with an apache running on it, with a virtual host that has simple proxypass and proxypassreverse directives.

I looked at NE, but I'm not sure how RewriteRule works.. can I use it in my virtual host? Why can't I do something like this in my virtual host?

RewriteRule ^(.*) $1 [NE]
->for any given request, make no modification and don't escape signs such as % (because my application server already converts white space into %20, and this should be transformed again into %2520)

jdMorgan

12:38 pm on Nov 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you're using ProxyPassReverse, then mod_rewrite is irrelevant. Mod_rewrite is just a different way to invoke a proxy through-put, rather than using ProxyPass. You'd use [NE,P] and specify the back-end machine's full URL (the code you posted above does nothing, as it rewrites the URL-path to itself, making no changes). You'd also need to "set up" mod_rewrite using Options and then enable the RewriteEngine using two additional directives.

My first point was that spaces are not valid characters in URI and therefore must be encoded, which is the root cause of all of this trouble. Spaces must be encoded to "%20". But "%" must also be encoded, so you end up with a double-encoded string, "%2520" as soon as this request passes through any other HTTP/1.x-compliant "Web agent" such as a proxy. If you piped the request through yet another agent, and you'd end up with "%252520", etc. See RFC 3986.

Jim

yvesman

3:03 pm on Nov 19, 2009 (gmt 0)

10+ Year Member



That explains it. We have to use the proxypass and proxypassdirectives.
I just checked out the application from localhost, and it has exactly the same urls with exactly the same %20 and %2520 instances! So it really is only the file name that is different, shortened. How strange no? How is a download file name communicated from one server to another, and from one server to the client? Where should I sniff for it in other words?

yvesman

3:04 pm on Nov 19, 2009 (gmt 0)

10+ Year Member



just realized that note isn't very clear. When I compare the application's download links whether I'm accessing the site through the reverse proxy url or through localhost on the server itself, I get the same links with %2520 and %20 instances. The only difference is the filename in my dialog box being shortened.

yvesman

4:43 pm on Nov 19, 2009 (gmt 0)

10+ Year Member



One more quick note (i'm getting closer!)
if I try to download a file direct from the server:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Expires: Thu, 19 Nov 2009 16:27:07 GMT
Cache-Control: max-age=3
Content-Disposition: attachment; filename*=utf8''some%20file%20pic.tif
Content-Type: application/unknown;charset=UTF-8
Content-Length: 30834284
Date: Thu, 19 Nov 2009 16:27:04 GMT

If gone through the proxy:
HTTP/1.1 200 OK
Date: Thu, 19 Nov 2009 16:22:55 GMT
Server: Apache-Coyote/1.1
Expires: Thu, 19 Nov 2009 16:22:58 GMT
Cache-Control: max-age=3
Content-Disposition: attachment; filename*=utf8''some file pic.tif
Content-Type: application/unknown;charset=UTF-8
Transfer-Encoding: chunked
Connection: Keep-Alive
Content-Encoding: gzip

jdMorgan

5:36 pm on Nov 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In the request phase, the "filename" is passed from the address bar (if a type-in) or from a link on a page into the browser, where it is URL-encoded in compliance with HTTP requirements. It is then sent to the network as a URI. In each active network node (e.g. proxy) through which this request passes, it is possible that the URI will be re-URL-encoded.

The encoding rules will differ based on whether the "filename" is passed directly (as the "GET" or "POST" URI) or whether it is passed as a query string appended to the URL-path. The RFC I cited above fully describes the encoding requirements for each case.

In the response phase, the "filename" isn't normally sent back (barring an error response). But in your case, your application is sending it back in the Content-Disposition header, in order to force a download, rather than allowing the browser to render the object on-screen. Interestingly, your application seems to be sending this filename in two formats -- one encoded and one not.

However, I can't tell (because the client request header was omitted in both entries you posted) whether this was due to the application receiving that filename in two different forms (spaces vs. %20) depending on whether it passed through your front-end proxy, or because the application was somehow aware of proxied versus non-proxied requests and changed the disposition response header format because of that. Although unlikely, it could be examining the X-Forwarded-For header, for example, but I don't know why it might need to do so.

The best bet here seems to be to modify the application to enforce the same encoding rules in the Content-Disposition response headers that it sends as are required in HTTP request lines.

Jim

jdMorgan

5:43 pm on Nov 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Adding to the above, it seems to me that it would be useful to examine the raw access logs for the application server to see what the "requested filename" looks like in a proxied request as compared to a non-proxied request.

However, even if they are different, indicating that the front-end is modifying the request, it'll still be down to the application to handle the differences, I think.

Jim

yvesman

9:45 am on Nov 20, 2009 (gmt 0)

10+ Year Member



Hi jdMorgan,

I looked at the RFC, esp. appendix C. thanks. Do you think I wouldn't have this problem if the URI was better surrounded, with real double quotes or <> for example?
Because right now it's only preceeded by 2 single quotes, and never closed. Or is it simply that even in that case, the %20 is problematic?

If mod_rewrite is irrelevant, is there nothing I can do to stop the reverse proxy from decoding the %20 in the filename field?

Thanks,
Yves