Tainted Bugs (or, Automatically detecting XSS security holes)

With apologies to Gloria Jones and a variety of others...

Sometimes I feel there has to be a way
To improve securi-tay
To automatically prevent attacks
The bugs we fix seem not to help one bit
To make the exploit-tays
Not come back. They should stay away!
Oh! Tainted bugs!

As part of Acquia's security testing for Acquia Drupal, I've been experimenting with automated methods for detecting security vulnerabilities in Drupal and contributed modules. The time has come to report on my progress. If you want to learn more about this and are going to DrupalCon Hungary 2008, vote for my session proposal.

Data tainting is a kind of dynamic program analysis (it operates while the program is running) that can automatically detect one of the most frequent sources of security vulnerabilities: insufficiently validated user input. The idea behind data tainting is that when data is received from an untrusted source (such as GET request parameters or some parts of the database), it is marked as "tainted." Any data derived from tainted data (such as by string concatenation, passing function arguments, etc.) is also marked tainted. When tainted data is passed to a security-sensitive destination (such as the HTML response to a page request), a taint error is raised. Finally, when tainted data is validated in specific ways, the taint mark is removed so the data can be used freely.

What I am calling "Taint Drupal" is based on Taint PHP work by Wietse Venema along with some Drupal-specific customization particularly regarding the database. For more details, keep reading.

Tainting memory

Here is a simple example:

function message($arg) {
'Your name is ' . $arg;

$name = $_GET['name'];

All request parameters in $_GET come directly from the user and can contain anything; therefore, they are "born tainted." $_GET['name'] is assigned to $name so, along with the actual value, $_GET['name']'s "taintedness" is also assigned to $name. When $name is passed to the function message(), the $arg argument also inherits the taint value. In the return statement, the string literal 'Your name is ' is not tainted because it comes from a trusted source (the program source code itself); however, when the non-tainted string literal is concatenated with the tainted $arg, the resulting string is tainted by $arg's, so the function result is also tainted. Finally, the print() statement receives this tainted value and, because it is programmed to, generates a taint error:

print(): Argument contains data that is not converted with htmlspecialchars() or htmlentities() in yourscript.php on line 6

As the error message indicates, the problem is that we did not validate $_GET['name'] with htmlspecialchars(). If we change one line of the script to

= htmlspecialchars($_GET['name']);

the taint flag is removed from $name and print() no longer generates a warning.

Tainting database reads

This works great for data in memory during a single page request, but what about the database? Consider a simplified version of Drupal's node table, the basis of all locally stored content:

   id integer auto_increment,
   uid integer,
   title varchar(255),
   body text

When a user submits a node, Drupal stores the exact text provided for the title and body; it performs validation during output (there is a very good reason for this approach). Therefore, all data read from the title and body fields should be "born tainted" just like $_GET because the user has complete control over it. The the title or body are output before being run through htmlspecialchars() or some other similar validator, a taint error should result. The uid field, however, is the internally-generated user number for the author of the node. The user has no control over it. If the user is output in HTML (perhaps as part of a CSS style to allow per-user styling), no taint error should occur.

In Taint Drupal, I mark the title and body fields in the database schema as tainted (see, I told you it would come in handy! :-)). Whenever a SELECT query reads in these columns, my custom database driver marks them as tainted, just as if they had come directly from a user (which, ultimately, they did). Since the uid column is not marked tainted, the database driver leaves it alone.

Taint-checking database writes

Alert readers may have noticed I wrote "the user has no control over" the uid value. Well, what if a bug gives the user control? Taint Drupal prevents this by performing the reverse operation of SELECT-tainting: INSERT/UPDATE-taint-checking. Whenever it writes data to a column that is not marked tainted, it verifies that data is not tainted and logs an error if it is.


Here's a perfect example of Taint Drupal in action. Drupal 6.0 contained this code:

function node_page_edit($node) {
drupal_get_form($node->type . '_node_form', $node);

In Drupal 6, the drupal_set_title() function required that the caller validate the title value, typically with check_plain(), so this code represented a XSS vulnerability. A user could set a node's title to something like "<script>alert('XSS!')</script>" and, when some other user or administrator visited the node's Edit page, the script could execute. (This bug was fixed in Drupal 6.1.)

With Taint Drupal, visiting the Edit page for *any* node while this bug exists generates warnings:

  • warning: print() [function.print]: Argument contains data that is not converted with htmlspecialchars() or htmlentities() in .../themes/garland/page.tpl.php on line 7.
  • warning: print() [function.print]: Argument contains data that is not converted with htmlspecialchars() or htmlentities() in .../themes/garland/page.tpl.php on line 70.

page.tpl.php lines 7 and 70 are the two places where the node title is displayed in HTML. So, instead of waiting for someone to discover the vulnerability, Taint Drupal would have pointed it out the moment it was created.

Current status

Taint Drupal is still very much experimental. It depends on patches to Taint PHP, a custom-built version of the PHP interpreter, minimal Drupal core patches, and a custom module. It is also far from complete. The Taint PHP patches are excellent but incomplete; for example, I discovered that the internal PHP function str_replace() was not taint-enabled and as a result all $_POST data processed by the Form API was being de-tainted automatically! (I fixed that.) Not all of Drupal's core tables are yet taint-marked correctly. I am handling SELECT and INSERT queries, but not UPDATE or DELETE queries. There is a lot left to do.

If you'd like more details about this work, vote for my session proposal at DrupalCon 2008.