The Schema Project: Database abstraction, reflection, and migration

Almost a year after first discussing the idea on the Drupal developer’s mailing list I am finally ready to announce the Drupal Schema project. The goal of the project is to provide a database schema abstraction layer for Drupal. The benefits of the project will include more reliable compatibility with a greater range of database systems, reduced install-related code size, centralization of the growing schema information required by or for other modules (such as Views and CCK), and a wide variety of other advanced capabilities that will no doubt arise once the system is in place.

The killer application of the Schema Project’s capabilities, I believe, will be “incremental data migration”: the ability to move almost any subset of a Drupal system (nodes, Views, Workflows, CCK types, configuration settings, etc.) from one live, running system to another. Incremental data migration will make it possible to develop, test, stage, and promote site changes without having to manually re-create configuration and data at each step along the way.

I’ll be at the 2007 OSCMS Summit and hope to gather some momentum for this project while there.

What is database schema abstraction?

A database schema is a description of the structure of a database. For a relational database, the schema specifies the tables, columns, and relationships among them. In Drupal’s case, the schema specifies things like: “A node is identified by a nid and has a title, author, creation time, body, and …”.

A database schema abstraction is some kind of language or representation of a database schema that allows specifying, studying, or manipulating a database schema. Database schemas and their abstractions are a large topic in computer science and software development. There are schema modelling languages, entity relationship diagrams, and all manner of approaches and tools for working with them.

I am not a database systems expert and know very little about the science of database schema abstractions in general (though I imagine I’ll learn more than I wanted to know through this project). I do not want to and cannot move Drupal to the cutting edge database modelling. Instead, my goal is to take an incremental step forward that I believe will provide substantial benefit for relatively minor effort.

What is database schema reflection?

It would be possible (and quite useful) to define Drupal’s database schema in documentation that can only be read by humans. My goal, however, is to provide Drupal with “schema reflection”: the ability use and manipulate its own schema at run-time.

The Forms API provides a perfect example. Before Drupal 4.7, each form was separately constructed directly as HTML. For one module to modify another module’s form, it could either add things at the top or bottom or use string replacement to change things in the middle. This approach was error-prone and inflexible. Starting with Drupal 4.7, forms are represented as structured data until the very last minute. Now, for one module to modify another module’s form, it uses the data structure to access each element and attribute by name or insert new elements at any location. This change to data-driven forms provided an enormous increase in power and flexibility and, not incidentally, made form-generation code much easier to read and maintain.

In current versions of Drupal, database tables are created by explicit CREATE TABLE statements embedded in each module’s .install file. Since every SQL database engine uses different syntax for these statements, a separate copy of each one is required for each database. Each module has a “schema version number” stored in a central table but if a module’s code and schema get out of date (and they can) there is no way for Drupal to notice or recover from the inconsistency.

With schema reflection, Drupal’s current database schema as the modules expect it to be is represented in a single data structure. Each module only needs to define its tables once and the Schema module can create them in whatever DBMS engine is being used. The Schema module can inspect the actual current structure of the database and compare it to the module-generated schema, noticing and warning (or possibly acting) on any changes.

Finally, schema relfection will allow Drupal to act on semantic information about the schema itself. For example, the schema data structure can specify that “the comment table joins to the node table using the nid field as a many-to-one relationship.” The as-yet-undesigned Migration module could then use this information to determine that, in order to migrate a node from one system to another, it must also gather all the comments connected to that node and migrate them as well.

What will the code look like?

The code I have created so far is a very early prototype. I am certain it contains design errors and ommissions. Everything about it is subject to change. I am presenting it here just to give an idea of what I have in mind.

Every module’s tables are defined in a data structure (just like a form), indicating its columns and data types, keys, etc. Not surprisingly, it looks very much like a Views hook_views_tables structure. For example:

$schema['node'] = array(
    'cols' => array(
         'nid' => array('int(10) unsigned', NOT_NULL),
         'vid' => array('int(10) unsigned', NOT_NULL, 'default' => '0'),
         'type' => array('varchar(32)', NOT_NULL),
// and so forth
),
    'keys' => array(
         'PRIMARY' => array('nid, vid', 'UNIQUE'),
         'vid' => array('vid', 'UNIQUE'),
         'nid' => array('nid'),
// and so forth
));
  • The ‘cols’ array has an element for each column.

  • Each column specifier includes the mandatory information name, type, and whether or not it can be NULL as well as optional information such as default values.

    • I have the mandatory information as positional elements simply to reduce typing (to eliminate ‘name’ => $name, ‘type’ => $type, etc.). This may or may not be a good idea.

    • Optional information can include proprietary DBMS parameters, semantic information such as ‘join n-to-1’ => ‘tablename.column’, module-specific information such as what Views needs to know for fields, filters, arguments, and sorts, and so forth.

  • The ‘keys’ array has an element for each table index.

  • Each key specifier includes the name of the key/index, the columns included in the index, and whether or not it is a unique index.

Each module that wants to define tables in the schema implements hook_schema:

function mymodule_schema() {
  $schema['#version'] = 1;
  $schema['mymodule_table_1'] = array(/* table definition */);
  $schema['mymodule_table_2'] = array(/* table definition */);
  // etc
  return $schema;
}

The Schema module invokes hook_schema, collects the results, decide what tables need to be created, and hands those tables off to a DBMS engine driver that translates into appropriate CREATE TABLE statements.

An open question is how schema updates will occur. The simple solution is to continue using the present install approach:

function mymodule_schema_update_2() {
  // Explicitly update the schema from version 1 to version 2.
  db_add_column('mymodule_table_1', 'new_column');
  db_remove_column('mymodule_table_1', 'old_column');
}

Another possibility is to include schema version information in the table definitions:

$schema['#version'] = 2;
$schema['mymodule_table_1'] = array(
    'cols' => array(
         'new_column' => array(...,
       'version_added' => 2),
         'old_column' => array(...,
       'version_removed' => 2),
         // etc
));

When the Schema module sees that the version for mymodule had changed from 1 to 2, it can scan through the tables looking for things changed in version 2 and automatically add or remove them (or invoke an update callback specified in the array). One advantage of this approach is that it stores current and historical information about schema structure in one place. However, there will probably always be some kinds of schema updates that require custom code.

Core or contrib?

I do not see any reason that the Schema system needs to be part of Drupal core. Any module that wants to participate in Schema defines hook_schema; any module that does not just continues to call db_query(‘CREATE TABLE’) in hook_install. If Schema is successful, over time more and more modules will support it, just a over time more and more modules have added support for Views.

Schema would not be very useful if Drupal’s core tables were not included, of course. The Schema module will provide the schema information for core tables via its own hook_schema.

What does this buy us?

The Schema proeject will provide many benefits:

  • Several modules (e.g. Views) already need schema reflection; this just centralizes the information and makes it much easier to create new modules that need reflection. A module that wants to support Views or other similar modules can store any necessary data in its $schema array. In fact, many modules will probably not have to do anything more to support such reflection-requiring modules.

  • Code size reduction and elimination of duplicate code. The automatically-generated schema for core is only slightly larger than all the explicit CREATE TABLE statements for a single database engine, but the single schema array will support multiple engines instead of requiring a duplicate set of CREATE TABLE statements for each engine.

  • The Schema module can use database inspection to validate the database layout against the defined schema, reporting errors and/or disabling any modules whose tables are not what they expect.

  • It can provide hook_schema_alter, allowing modules to alter other modules’ tables in a clean way. For example, a node-enhancing module could add additional columns to the node table, eliminating the need to join against its own custom table.

  • As already described, it is a huge step towards enabling incremental data migration across Drupal systems.

  • The Forms API made all kinds of things possible that were not before, many of them not foreseen until after FAPI existed. Schema will, too.

Arguable, this approach is somewhat more complicated than writing a single MySQL CREATE TABLE statement. Supporting Views is extra effort, too, but has a big payoff for module developers; so will this. I assert that it is easier to create a structure like this once than it is to write and maintain separate SQL statements for MySQL, Postgres, Oracle, Sybase, MS SQL Server, and god knows what else.

Current status and call for contributors

The Schema module currently knows how to:

  • Inspect a MySQL database to generate a schema array for existing tables and indexes (and, quite helpfully, output the PHP code to build the $schema array for those tables).

  • Call hook_schema for all modules that provide it.

  • Identify which tables returned by hook_schema do not exist in the database.

  • Generate and execute MySQL CREATE TABLE statements for missing tables.

There is a lot to be done:

  • The structure of the $schema array needs to be worked out.

  • General data type names and other column and index fields need to be defined (right now the code assumes MySQL-compatible strings).

  • A Schema engine driver for (at least) Postgres needs to be implemented.

  • Much more, I’m sure.

Please let me know if you’d like to help out!