A huge amount of web use just use linked resourced in http: space. Other programs on a unix machine like Gnu/Linux or Mac OSX operate in a quite linked data in the file system. It is actually pretty interesting to live on the edge, or more specifically on the intersection of these worlds where you can address the same files both as local files and as resources on the web. Why do both? Well, different things work better in different worlds
HTTP space | File space |
---|---|
Access to people not on the machine | Files you don't want to share |
Access control using web-wide or server-wide identities | Access control using unix users and groups |
Slower | Faster access for serious data munging |
Use web browser, web services out there | Use all kinds of apps, command line programs, scripts, etc |
git or your favorite DVCS | |
make or your favorite project building system |
So there is a case to be made for having stuff which is in both spaces at the same time. So you have these things in a file system and you have a web server running over that file space, and sometimes you use one and sometimes the other.
So the question which arises is in the web server, how should you map the HTTP URI requested into a filename? All basic off the shelf web servers do out of the box. That sounds simple, right? Just make the URL the same as the filename, modulo the start path.
One problem with that is that most operating systems internally use file extensions -- the part after the dot in "foo.html" -- to determine the type of the file. In HTTP, this information is communicated explicitly using the Content-Type: field. You can set your server up to serve the right content-type by looking at the file extension. But what happens if in the web world, you give write access to a space and someone writes a URL in with no file extension or a file extension which is wrong, does not match the actual content type of the file? This is a typical attack for malicious person: to put a file called "read.exe" ostensibly of content-type text/html, but which in file: space will leave something which will look to the operating system like a program, and when you click on it will run and give someone else control of your machine.
So it seems that there are three incompatible requirements.
These are obviously incompatible. So we have to pick two of the three.
This is no fun, as each requirement is hard to do without.
So, which of these requirements should we sacrifice? Let's look at three systems, each of which sacrifices a different one.
Here is one way to keep the other two requirements if we drop direct mapping. Suppose a user saves a file to the server using an HTTP PUT as follows:
URI on the web | Content type | Filename on the system |
---|---|---|
http://localhost/space/foo.html | text/html | /users/bob/space/foo.html.html |
http://localhost/space/foo.bar | text/html | /users/bob/space/foo.bar.html |
http://localhost/space/foo.exe | text/html | /users/bob/space/foo.exe.html |
We have dropped the requirement that filenames and URIs should directly correspond. But the whole premiss of this article is that being able to switch between spaces is a very powerful thing to do. The motivations are many, some of them were discussed in the first section.
A simple way to preserve the other two requirements is to map as follows
URI on the web | Content type | Filename on the system |
---|---|---|
http://localhost/space/foo.html | text/html | /users/bob/space/foo.html |
http://localhost/space/foo.bar | text/html | BLOCKED - return error |
http://localhost/space/foo.exe | text/html | BLOCKED - return error |
We have dropped the requirement that a web-side user should be able to PUT a file with any URI they like onto the server. To break this one, you are turning your server from being a general-purpose storage space which can be used out of the box by any read-write-web application, into a space which can only be used by an application which has absorbed and adapted to the particular constraints you have on your server. For this to be practial for
A simple way to preserve the other two requirements is to map as follows
URI on the web | Content type | Filename on the system |
---|---|---|
http://localhost/space/foo.html | text/html | /users/bob/space/foo.html |
http://localhost/space/foo.bar | text/html | /users/bob/space/foo.bar |
http://localhost/space/foo.exe | text/html | /users/bob/space/foo.exe |
We have dropped the reuirement that tithin file: space the extensions match the actual type of the files. This is actually both a convenience and a security issue. A classic attack to any system is to confuse it about the type of data, by for example storing a file which was received as text/html in a filename extension (such as ".exe") which means something else, so when someone clicks on it they end up executing a program and giving away control of their machine. Control of the local file extensions of things received from elsewhere is a crucial part of web security 101 and not negotiable.
None of these are very appealing
It turns out that it is possible to get 2.5 out of three, if you accept a weaker version of the first requirement. Suppose instead of
To achieve this, you can have the convention that URLs are mapped directly to filenames if and only if that produces a filename with a correct extension, and in other case they are mapped to a different filename which (a) can be round-tripped from the URI and (b) has the right extension. For example one can add an otherwise forbidden character such as "$" to flag the difference between a URI which has been tweaked and URL which is raw.
URI on the web | Content type | Filename on the system |
---|---|---|
http://localhost/space/foo.html | text/html | /users/bob/space/foo.html |
http://localhost/space/foo.bar | text/html | /users/bob/space/foo.bar$.html |
http://localhost/space/foo.exe | text/html | /users/bob/space/foo.exe$.html |
This means that if I am prepared to use generated URIs which have file-like extension parts, then I get the ability to map between the two spaces. If I don't, then I don't, but everything else works.
Note the scenario above is for a PUT by a client: when the client gives the URI and the Content-Type, and the server has to work out the filename. That is a simple mathematical function.
When a client later does a GET of the same URI, the server has to search for the file which matches the URI, or is the URI with a dollar suffix. So the mapping function is not just a mathematical function, it needs to look at the directory. (As does the conneg algorithm).
When the server is listing a directory, then the generation of URIs from filenames is a simple function.
An interesting opportunity here is that many servers end up using exactly the same mapping, with the result that I can simultaneously run more than one HTTP server over the same file space. I can also tar up (zip up) a chunk of file space from one server and dump it onto another server. So this is an interesting problem to solve for one server, but it is also an interesting one to solve in a common way for inter-server portability of datasets.