Mapping http: and file: spaces- Design Issues

Motivation

A huge amount of web use just use linked resourced in http: space. Other programs on a unix machine like Gnu/Linux or Mac OSX operate in a quite linked data in the file system. It is actually pretty interesting to live on the edge, or more specifically on the intersection of these worlds where you can address the same files both as local files and as resources on the web. Why do both? Well, different things work better in different worlds

HTTP space	File space
Access to people not on the machine	Files you don't want to share
Access control using web-wide or server-wide identities	Access control using unix users and groups
Slower	Faster access for serious data munging
Use web browser, web services out there	Use all kinds of apps, command line programs, scripts, etc
	`git` or your favorite DVCS
	`make` or your favorite project building system

So there is a case to be made for having stuff which is in both spaces at the same time. So you have these things in a file system and you have a web server running over that file space, and sometimes you use one and sometimes the other.

So the question which arises is in the web server, how should you map the HTTP URI requested into a filename? All basic off the shelf web servers do out of the box. That sounds simple, right? Just make the URL the same as the filename, modulo the start path.

One problem with that is that most operating systems internally use file extensions -- the part after the dot in "foo.html" -- to determine the type of the file. In HTTP, this information is communicated explicitly using the Content-Type: field. You can set your server up to serve the right content-type by looking at the file extension. But what happens if in the web world, you give write access to a space and someone writes a URL in with no file extension or a file extension which is wrong, does not match the actual content type of the file? This is a typical attack for malicious person: to put a file called "read.exe" ostensibly of content-type text/html, but which in file: space will leave something which will look to the operating system like a program, and when you click on it will run and give someone else control of your machine.

So it seems that there are three incompatible requirements.

The filenames and URIs should directly correspond -- so that links work on both spaces
A web-side user should be able to PUT a file with any URI they like onto the server
Within file: space the extensions match the actual type of the files.

These are obviously incompatible. So we have to pick two of the three.

This is no fun, as each requirement is hard to do without.

So, which of these requirements should we sacrifice? Let's look at three systems, each of which sacrifices a different one.

Try 1: Drop direct mapping?

Here is one way to keep the other two requirements if we drop direct mapping. Suppose a user saves a file to the server using an HTTP PUT as follows:

URI on the web	Content type	Filename on the system
http://localhost/space/foo.html	text/html	/users/bob/space/foo.html.html
http://localhost/space/foo.bar	text/html	/users/bob/space/foo.bar.html
http://localhost/space/foo.exe	text/html	/users/bob/space/foo.exe.html

We have dropped the requirement that filenames and URIs should directly correspond. But the whole premiss of this article is that being able to switch between spaces is a very powerful thing to do. The motivations are many, some of them were discussed in the first section.

Try 2: Drop arbitrary PUT URIs?

A simple way to preserve the other two requirements is to map as follows

URI on the web	Content type	Filename on the system
http://localhost/space/foo.html	text/html	/users/bob/space/foo.html
http://localhost/space/foo.bar	text/html	BLOCKED - return error
http://localhost/space/foo.exe	text/html	BLOCKED - return error

We have dropped the requirement that a web-side user should be able to PUT a file with any URI they like onto the server. To break this one, you are turning your server from being a general-purpose storage space which can be used out of the box by any read-write-web application, into a space which can only be used by an application which has absorbed and adapted to the particular constraints you have on your server. For this to be practial for

Try 3: Drop correct extensions?

A simple way to preserve the other two requirements is to map as follows

URI on the web	Content type	Filename on the system
http://localhost/space/foo.html	text/html	/users/bob/space/foo.html
http://localhost/space/foo.bar	text/html	/users/bob/space/foo.bar
http://localhost/space/foo.exe	text/html	/users/bob/space/foo.exe

We have dropped the reuirement that tithin file: space the extensions match the actual type of the files. This is actually both a convenience and a security issue. A classic attack to any system is to confuse it about the type of data, by for example storing a file which was received as text/html in a filename extension (such as ".exe") which means something else, so when someone clicks on it they end up executing a program and giving away control of their machine. Control of the local file extensions of things received from elsewhere is a crucial part of web security 101 and not negotiable.

None of these are very appealing

Solution: The sweet spot

It turns out that it is possible to get 2.5 out of three, if you accept a weaker version of the first requirement. Suppose instead of

All filenames and URIs should directly correspond -- so that links work on both spaces

you have

Some filenames and URIs should directly correspond -- so that links work on both spaces for people who pick those names

To achieve this, you can have the convention that URLs are mapped directly to filenames if and only if that produces a filename with a correct extension, and in other case they are mapped to a different filename which (a) can be round-tripped from the URI and (b) has the right extension. For example one can add an otherwise forbidden character such as "$" to flag the difference between a URI which has been tweaked and URL which is raw.

URI on the web	Content type	Filename on the system
http://localhost/space/foo.html	text/html	/users/bob/space/foo.html
http://localhost/space/foo.bar	text/html	/users/bob/space/foo.bar$.html
http://localhost/space/foo.exe	text/html	/users/bob/space/foo.exe$.html

This means that if I am prepared to use generated URIs which have file-like extension parts, then I get the ability to map between the two spaces. If I don't, then I don't, but everything else works.

Note the scenario above is for a PUT by a client: when the client gives the URI and the Content-Type, and the server has to work out the filename. That is a simple mathematical function.

When a client later does a GET of the same URI, the server has to search for the file which matches the URI, or is the URI with a dollar suffix. So the mapping function is not just a mathematical function, it needs to look at the directory. (As does the conneg algorithm).

When the server is listing a directory, then the generation of URIs from filenames is a simple function.

An interesting opportunity here is that many servers end up using exactly the same mapping, with the result that I can simultaneously run more than one HTTP server over the same file space. I can also tar up (zip up) a chunk of file space from one server and dump it onto another server. So this is an interesting problem to solve for one server, but it is also an interesting one to solve in a common way for inter-server portability of datasets.

Mapping between HTTP URLs and filenames on a server

Motivation

Try 1: Drop direct mapping?

Try 2: Drop arbitrary PUT URIs?

Try 3: Drop correct extensions?

Solution: The sweet spot