Forum Moderators: phranque
Hey Everyone,
I would like to call upon your expertise on the subject of whitespace¦space being converted into %20 and then into %2520 (wrongly), and how to convert either of those to a neutral character like underscore.
I've read every thread I could find about this subject, but didn't find a clear answer.
I work on Apache 2.0/IBM HTTPServer environment acting as a reverse proxy (proxypass and proxypassreverse directives), in a virtual host. This rproxy serves up an application that allows file downloads. when I hover over a link to download a file, it's full of %20's; but when I inspect the properties of the link, its' full of %2520's, and finally, when I try to download it, the name of the file is only the first word; the rest of the string is lost.
This only happens when the app is accessed through the reverse proxy, and only in Firefox!
Any clues?
Thank you so much.
Spaces and reserved characters such as "%" are required to be encoded by Web agents.
Then look at the exact method used to implement the reverse proxy. If by some chance, it's using mod-rewrite, then look at the [NE] flag for RewriteRule.
At any rate, there's no magic cure at the server config level -- The linking policy must be corrected first. After that, you can use a bit of mod_rewrite to call a script to issue 301 redirects to 'clean up' old links in the search engine results, but as stated, that's a 'clean-up' step.
Jim
For the reverse proxy I literally just have a separate machine with an apache running on it, with a virtual host that has simple proxypass and proxypassreverse directives.
I looked at NE, but I'm not sure how RewriteRule works.. can I use it in my virtual host? Why can't I do something like this in my virtual host?
RewriteRule ^(.*) $1 [NE]
->for any given request, make no modification and don't escape signs such as % (because my application server already converts white space into %20, and this should be transformed again into %2520)
My first point was that spaces are not valid characters in URI and therefore must be encoded, which is the root cause of all of this trouble. Spaces must be encoded to "%20". But "%" must also be encoded, so you end up with a double-encoded string, "%2520" as soon as this request passes through any other HTTP/1.x-compliant "Web agent" such as a proxy. If you piped the request through yet another agent, and you'd end up with "%252520", etc. See RFC 3986.
Jim
If gone through the proxy:
HTTP/1.1 200 OK
Date: Thu, 19 Nov 2009 16:22:55 GMT
Server: Apache-Coyote/1.1
Expires: Thu, 19 Nov 2009 16:22:58 GMT
Cache-Control: max-age=3
Content-Disposition: attachment; filename*=utf8''some file pic.tif
Content-Type: application/unknown;charset=UTF-8
Transfer-Encoding: chunked
Connection: Keep-Alive
Content-Encoding: gzip
The encoding rules will differ based on whether the "filename" is passed directly (as the "GET" or "POST" URI) or whether it is passed as a query string appended to the URL-path. The RFC I cited above fully describes the encoding requirements for each case.
In the response phase, the "filename" isn't normally sent back (barring an error response). But in your case, your application is sending it back in the Content-Disposition header, in order to force a download, rather than allowing the browser to render the object on-screen. Interestingly, your application seems to be sending this filename in two formats -- one encoded and one not.
However, I can't tell (because the client request header was omitted in both entries you posted) whether this was due to the application receiving that filename in two different forms (spaces vs. %20) depending on whether it passed through your front-end proxy, or because the application was somehow aware of proxied versus non-proxied requests and changed the disposition response header format because of that. Although unlikely, it could be examining the X-Forwarded-For header, for example, but I don't know why it might need to do so.
The best bet here seems to be to modify the application to enforce the same encoding rules in the Content-Disposition response headers that it sends as are required in HTTP request lines.
Jim
However, even if they are different, indicating that the front-end is modifying the request, it'll still be down to the application to handle the differences, I think.
Jim
I looked at the RFC, esp. appendix C. thanks. Do you think I wouldn't have this problem if the URI was better surrounded, with real double quotes or <> for example?
Because right now it's only preceeded by 2 single quotes, and never closed. Or is it simply that even in that case, the %20 is problematic?
If mod_rewrite is irrelevant, is there nothing I can do to stop the reverse proxy from decoding the %20 in the filename field?
Thanks,
Yves