Content hosting for the modern web

Wednesday, August 29, 2012 9:45 AM

Our applications host a variety of web content on behalf of our users, and over the years we learned that even something as simple as serving a profile image can be surprisingly fraught with pitfalls. Today, we wanted to share some of our findings about content hosting, along with the approaches we developed to mitigate the risks.

Historically, all browsers and browser plugins were designed simply to excel at displaying several common types of web content, and to be tolerant of any mistakes made by website owners. In the days of static HTML and simple web applications, giving the owner of the domain authoritative control over how the content is displayed wasn’t of any importance.

It wasn’t until the mid-2000s that we started to notice a problem: a clever attacker could manipulate the browser into interpreting seemingly harmless images or text documents as HTML, Java, or Flash—thus gaining the ability to execute malicious scripts in the security context of the application displaying these documents (essentially, a cross-site scripting flaw). For all the increasingly sensitive web applications, this was very bad news.

During the past few years, modern browsers began to improve. For example, the browser vendors limited the amount of second-guessing performed on text documents, certain types of images, and unknown MIME types. However, there are many standards-enshrined design decisions—such as ignoring MIME information on any content loaded through <object> , <embed> , or <applet> —that are much more difficult to fix; these practices may lead to vulnerabilities similar to the GIFAR bug.

Google’s security team played an active role in investigating and remediating many content sniffing vulnerabilities during this period. In fact, many of the enforcement proposals were first prototyped in Chrome. Even still, the overall progress is slow; for every resolved problem, researchers discover a previously unknown flaw in another browser mechanism. Two recent examples are the Byte Order Mark (BOM) vulnerability reported to us by Masato Kinugawa, or the MHTML attacks that we have seen happening in the wild.

For a while, we focused on content sanitization as a possible workaround - but in many cases, we found it to be insufficient. For example, Aleksandr Dobkin managed to construct a purely alphanumeric Flash applet, and in our internal work the Google security team created images that can be forced to include a particular plaintext string in their body, after being scrubbed and recoded in a deterministic way.

In the end, we reacted to this raft of content hosting problems by placing some of the high-risk content in separate, isolated web origins—most commonly * There, the “sandboxed” files pose virtually no threat to the applications themselves, or to authentication cookies. For public content, that’s all we need: we may use random or user-specific subdomains, depending on the degree of isolation required between unrelated documents, but otherwise the solution just works.

The situation gets more interesting for non-public documents, however. Copying users’ normal authentication cookies to the “sandbox” domain would defeat the purpose. The natural alternative is to move the secret token used to confer access rights from the Cookie header to a value embedded in the URL, and make the token unique to every document instead of keeping it global.

While this solution eliminates many of the significant design flaws associated with HTTP cookies, it trades one imperfect authentication mechanism for another. In particular, it’s important to note there are more ways to accidentally leak a capability-bearing URL than there are to accidentally leak cookies; the most notable risk is disclosure through the Referer header for any document format capable of including external subresources or of linking to external sites.

In our applications, we take a risk-based approach. Generally speaking, we tend to use three strategies:
  • In higher risk situations (e.g. documents with elevated risk of URL disclosure), we may couple the URL token scheme with short-lived, document-specific cookies issued for specific subdomains of This mechanism, known within Google as FileComp, relies on a range of attack mitigation strategies that are too disruptive for Google applications at large, but work well in this highly constrained use case.
  • In cases where the risk of leaks is limited but responsive access controls are preferable (e.g., embedded images), we may issue URLs bound to a specific user, or ones that expire quickly.
  • In low-risk scenarios, where usability requirements necessitate a more balanced approach, we may opt for globally valid, longer-lived URLs.
Of course, the research into the security of web browsers continues, and the landscape of web applications is evolving rapidly. We are constantly tweaking our solutions to protect Google users even better, and even the solutions described here may change. Our commitment to making the Internet a safer place, however, will never waver.
The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.


oam said...

What about using a subdomain and having authentication cookies tied to * with the HTTPOnly flag set? It does sound risky but I can't think of any attack.

Thomas Skora said...

It not only sounds risky, hosting user content on sub domains is risky. I've seen several times that this has opened the way to exploitation of session fixation issues. There are further attack vectors as cross domain policies, CORS or document.domain for such setups.

So putting user provided content in a separate domain is an very good idea.

Michal Zalewski said...

oam: it's an improvement, but there are at least two problems with just using something like http[s]://

1) If the attacker knows the URL of any interesting private document within, and can host his own malicious file in the same origin, it is fairly easy to steal sensitive data.

2) Although httponly cookies can't be read back by scripts (spare for semi-frequent plugin bugs), they can be typically overwritten with some minimal effort - which will often have very serious consequences, especially for complex web apps.

oam said...

Yeah it makes sense. Thanks !

Chris Weber said...

Was the "Byte Order Mark (BOM) vulnerability reported to us by Masato Kinugawa" described anywhere in more detail?

Michal Zalewski said...

Probably not in English :-) But the basic idea is that Internet Explorer would give precedence to BOM indicators in the file over charset= value present in Content-Type or META, allowing many documents to suddenly become UTF-7 or so.

I believe that Microsoft folks changed this behavior earlier this year.

Will Sargent said...

To oam's question about subdomains, I believe that if you allow this and you have loose cookie rules, you are vulnerable to cookie tossing, aka "Same Origin Policy Abuse Techniques".

Nathan Belomy said...

The internet takes the path of Linux/Unix. All the design flaws will be changed in time. Changing the entire internet protocol suite is option 2. Think about writing a replacement for TCP/IP, it's a funny one.

enterprisemobilehub said...

Very informative post! Thanks a lot!


very informative point here.

Web Hosting India said...

Security is one of the major issue with my website, I had my website with only HTML and was not using any dynamic feature expect some little things. After a good start I start to get success online and decide to go with a wordpress website, but within a few week after my new website launch, I felt real setback because my website was showing error and showing some hack message. Don't know enough about these, my developer fail to handle the situation so I got the website restored by my web host, but I am still worried if it will became much worse then ?

Rakesh Khuntia said...

Ideally i think companies begin up with shared hosting services and move up to VPS /dedicated hosting. A nice brief on all types of hosting!