As part of Acquia's security testing for Acquia Drupal, I've been experimenting with automated methods for detecting security vulnerabilities in Drupal and contributed modules. The time has come to report on my progress. If you want to learn more about this and are going to DrupalCon Hungary 2008, vote for my session proposal.
Data tainting is a kind of dynamic program analysis (it operates while the program is running) that can automatically detect one of the most frequent sources of security vulnerabilities: insufficiently validated user input. The idea behind data tainting is that when data is received from an untrusted source (such as GET request parameters or some parts of the database), it is marked as "tainted." Any data derived from tainted data (such as by string concatenation, passing function arguments, etc.) is also marked tainted. When tainted data is passed to a security-sensitive destination (such as the HTML response to a page request), a taint error is raised. Finally, when tainted data is validated in specific ways, the taint mark is removed so the data can be used freely.
What I am calling "Taint Drupal" is based on Taint PHP work by Wietse Venema along with some Drupal-specific customization particularly regarding the database. For more details, keep reading.
Here is a simple example:
<?php
function message($arg) {
return 'Your name is ' . $arg;
}
$name = $_GET['name'];
print message($name);
?>All request parameters in $_GET come directly from the user and can contain anything; therefore, they are "born tainted." $_GET['name'] is assigned to $name so, along with the actual value, $_GET['name']'s "taintedness" is also assigned to $name. When $name is passed to the function message(), the $arg argument also inherits the taint value. In the return statement, the string literal 'Your name is ' is not tainted because it comes from a trusted source (the program source code itself); however, when the non-tainted string literal is concatenated with the tainted $arg, the resulting string is tainted by $arg's, so the function result is also tainted. Finally, the print() statement receives this tainted value and, because it is programmed to, generates a taint error:
As the error message indicates, the problem is that we did not validate $_GET['name'] with htmlspecialchars(). If we change one line of the script to
<?php
$name = htmlspecialchars($_GET['name']);
?>the taint flag is removed from $name and print() no longer generates a warning.
This works great for data in memory during a single page request, but what about the database? Consider a simplified version of Drupal's node table, the basis of all locally stored content:
CREATE TABLE node (
id integer auto_increment,
uid integer,
title varchar(255),
body text
);When a user submits a node, Drupal stores the exact text provided for the title and body; it performs validation during output (there is a very good reason for this approach). Therefore, all data read from the title and body fields should be "born tainted" just like $_GET because the user has complete control over it. The the title or body are output before being run through htmlspecialchars() or some other similar validator, a taint error should result. The uid field, however, is the internally-generated user number for the author of the node. The user has no control over it. If the user is output in HTML (perhaps as part of a CSS style to allow per-user styling), no taint error should occur.
In Taint Drupal, I mark the title and body fields in the database schema as tainted (see, I told you it would come in handy! :-)). Whenever a SELECT query reads in these columns, my custom database driver marks them as tainted, just as if they had come directly from a user (which, ultimately, they did). Since the uid column is not marked tainted, the database driver leaves it alone.
Alert readers may have noticed I wrote "the user has no control over" the uid value. Well, what if a bug gives the user control? Taint Drupal prevents this by performing the reverse operation of SELECT-tainting: INSERT/UPDATE-taint-checking. Whenever it writes data to a column that is not marked tainted, it verifies that data is not tainted and logs an error if it is.
Here's a perfect example of Taint Drupal in action. Drupal 6.0 contained this code:
<?php
function node_page_edit($node) {
drupal_set_title($node->title);
return drupal_get_form($node->type . '_node_form', $node);
}
?>In Drupal 6, the drupal_set_title() function required that the caller validate the title value, typically with check_plain(), so this code represented a XSS vulnerability. A user could set a node's title to something like "<script>alert('XSS!')</script>" and, when some other user or administrator visited the node's Edit page, the script could execute. (This bug was fixed in Drupal 6.1.)
With Taint Drupal, visiting the Edit page for *any* node while this bug exists generates warnings:
page.tpl.php lines 7 and 70 are the two places where the node title is displayed in HTML. So, instead of waiting for someone to discover the vulnerability, Taint Drupal would have pointed it out the moment it was created.
Taint Drupal is still very much experimental. It depends on patches to Taint PHP, a custom-built version of the PHP interpreter, minimal Drupal core patches, and a custom module. It is also far from complete. The Taint PHP patches are excellent but incomplete; for example, I discovered that the internal PHP function str_replace() was not taint-enabled and as a result all $_POST data processed by the Form API was being de-tainted automatically! (I fixed that.) Not all of Drupal's core tables are yet taint-marked correctly. I am handling SELECT and INSERT queries, but not UPDATE or DELETE queries. There is a lot left to do.
If you'd like more details about this work, vote for my session proposal at DrupalCon 2008.
Comments
Exciting stuff, Barry! I'll
Exciting stuff, Barry! I'll be voting and hopefully attending. This would simply be... marvelous. : )
Check
Check http://drupal.org/project/security_scanner too.
I always thought binary
I always thought binary taint flags were a bad idea because the approach ignores different output contexts.
Imagine code that builds a query string with user data, e.g.
<?php print l('link', 'path', array('query' => "foo=". $_GET['foo'] ."&bar=5")); ?>. Because l() passes the generated URL through check_plain(), this passes the taint test and the code is marked as secure. However, the code failed to notice that $_GET['foo'] needed to be urlencoded() before inserted into the query string: inserting an ampersand or hash in $_GET['foo']'s value will mess up the generated URLs semantics and lead to possible bugs or security holes.Furthermore, it naively assumes that there is a universal way of making strings safe (and hence untainting strings). This sort of thinking leads to silly pieces of code like this, where the wrong escaping function is used.
<?php// We escape the hostname because it can be modified by a visitor.
if (!empty($_SERVER['HTTP_HOST'])) {
$cookie_domain = check_plain($_SERVER['HTTP_HOST']);
}
...
ini_set('session.cookie_domain', $cookie_domain);
?>
(three guesses to where that snippet comes from)
I have an approach in mind that uses nested context tags to not only detect such errors, but also suggest appropriate conversions, but I never really got around to properly writing it down. Perhaps a follow up post to my string theory post is in order.
Steven, you make several
Steven, you make several good points, and I have several responses. :-)
1. Neither data tainting nor any other automated technique will ever catch all security bugs, but they will catch some. Automated techniques increase the quality of security testing but can not really reduce the time required to perform a security audit because manual inspection is still always required.
2. check_plain() validates that a string is safe to include in HTML. In your l() example, the validation still achieves that goal. The fact that $_GET['foo'] was not urlencoded first may result in other bugs or security holes, sure, but not a XSS attack via the output of l().
3. A single binary "tainted" bit does assume that there is a single way to make strings safe, which is untrue (another excellent example is that check_plain() is completely insufficient for prevent SQL injection attacks because it does not escape the correct characters for SQL). However, there is no requirement to stick with a single "tainted" bit. Taint PHP supports several: TC_HTML, TC_SQL, TC_PREG, etc. htmlspecialchars() clears the TC_HTML bit. mysql_real_escape_string() clears the TC_SQL bit. preg_quote() clears the TC_PREG bit.
This idea can handle your l() example in principle. We could define a TC_URL bit, and urlencode() could clear it. We could then define that l() raises a taint exception if any argument has the TC_URL bit (it would not need to raise a taint exception if any argument has the TC_HTML bit because l() performs its own check_plain()). In general, we'd like to be able to define any arbitrary taint bits, data from which sources are born with them, and which functions set, clear, or raise an exception on those bits. Taint PHP is not currently this flexible but in principle it could be.
Taint Drupal (and Taint PHP upon which it is based) is still very much experimental.
Barry: I guess I'm a bit
Barry: I guess I'm a bit more alarmed by the false negatives that can result from this technique than you are. If there is a security tool developed that audits code and finds XSS issues, there will be people who rely on it completely. While XSS is an important attack vector, you can do a lot of damage with e.g. HTTP header injection too.
Furthermore, a tool such as this could and should IMO be designed to solve the related problem of data integrity as well. I've observed countless times when coders encounter an encoding or escaping issue, that they will simply sprinkle random escapes around their code until it seems to work. Usually this ends up causing problems with edge cases, where some characters may disappear or multiply.
Try posting a backslash in a digg.com comment: unless you double it, it vanishes. When editing, your previously escaped html special characters and slashes are edited as shown on screen, requiring you to manually re-escape them. This is hardly a small, amateur site, but even they can't get it right.
In the Drupal world, most people still haven't figured out that menu paths are not URLs, and off-base statements about what check_plain does or doesn't do are still thrown around.
To solve these problems, I think a bitmask approach is insufficient. Taint flags are not orthogonal, because check_plain(urlencode($string)) is not the same as urlencode(check_plain($string)): though check_plain() will escape all htmlspecialchars, it does so by inserting ampersands, which affects GET URL semantics. So with a bitmask approach, each operation that affects taint flags would most likely have to unset only one taint flag and re-set all the others.
This means that you lose the ability to look back and see where a string came from, for example to validate that you can safely unescape a particular string, or to verify whether unescaping is even necessary at all: unescaping would end up setting all taint flags, making it impossible to audit any code after it without manual intervention.
I started writing up that blog post tho, so expect more on this in a day or so.
Post new comment