The Schema Project: Database abstraction, reflection, and migration
Almost a year after first discussing the idea on the Drupal developer’s mailing list I am finally ready to announce the Drupal Schema project. The goal of the project is to provide a database schema abstraction layer for Drupal. The benefits of the project will include more reliable compatibility with a greater range of database systems, reduced install-related code size, centralization of the growing schema information required by or for other modules (such as Views and CCK), and a wide variety of other advanced capabilities that will no doubt arise once the system is in place.
The killer application of the Schema Project’s capabilities, I believe, will be “incremental data migration”: the ability to move almost any subset of a Drupal system (nodes, Views, Workflows, CCK types, configuration settings, etc.) from one live, running system to another. Incremental data migration will make it possible to develop, test, stage, and promote site changes without having to manually re-create configuration and data at each step along the way.
I’ll be at the 2007 OSCMS Summit and hope to gather some momentum for this project while there.
What is database schema abstraction?
A database schema is a description of the structure of a database. For a relational database, the schema specifies the tables, columns, and relationships among them. In Drupal’s case, the schema specifies things like: “A node is identified by a nid and has a title, author, creation time, body, and …”.
A database schema abstraction is some kind of language or representation of a database schema that allows specifying, studying, or manipulating a database schema. Database schemas and their abstractions are a large topic in computer science and software development. There are schema modelling languages, entity relationship diagrams, and all manner of approaches and tools for working with them.
I am not a database systems expert and know very little about the science of database schema abstractions in general (though I imagine I’ll learn more than I wanted to know through this project). I do not want to and cannot move Drupal to the cutting edge database modelling. Instead, my goal is to take an incremental step forward that I believe will provide substantial benefit for relatively minor effort.
What is database schema reflection?
It would be possible (and quite useful) to define Drupal’s database schema in documentation that can only be read by humans. My goal, however, is to provide Drupal with “schema reflection”: the ability use and manipulate its own schema at run-time.
The Forms API provides a perfect example. Before Drupal 4.7, each form was separately constructed directly as HTML. For one module to modify another module’s form, it could either add things at the top or bottom or use string replacement to change things in the middle. This approach was error-prone and inflexible. Starting with Drupal 4.7, forms are represented as structured data until the very last minute. Now, for one module to modify another module’s form, it uses the data structure to access each element and attribute by name or insert new elements at any location. This change to data-driven forms provided an enormous increase in power and flexibility and, not incidentally, made form-generation code much easier to read and maintain.
In current versions of Drupal, database tables are created by explicit CREATE TABLE statements embedded in each module’s .install file. Since every SQL database engine uses different syntax for these statements, a separate copy of each one is required for each database. Each module has a “schema version number” stored in a central table but if a module’s code and schema get out of date (and they can) there is no way for Drupal to notice or recover from the inconsistency.
With schema reflection, Drupal’s current database schema as the modules expect it to be is represented in a single data structure. Each module only needs to define its tables once and the Schema module can create them in whatever DBMS engine is being used. The Schema module can inspect the actual current structure of the database and compare it to the module-generated schema, noticing and warning (or possibly acting) on any changes.
Finally, schema relfection will allow Drupal to act on semantic information about the schema itself. For example, the schema data structure can specify that “the comment table joins to the node table using the nid field as a many-to-one relationship.” The as-yet-undesigned Migration module could then use this information to determine that, in order to migrate a node from one system to another, it must also gather all the comments connected to that node and migrate them as well.
What will the code look like?
The code I have created so far is a very early prototype. I am certain it contains design errors and ommissions. Everything about it is subject to change. I am presenting it here just to give an idea of what I have in mind.
Every module’s tables are defined in a data structure (just like a form), indicating its columns and data types, keys, etc. Not surprisingly, it looks very much like a Views hook_views_tables structure. For example:
$schema['node'] = array(
'cols' => array(
'nid' => array('int(10) unsigned', NOT_NULL),
'vid' => array('int(10) unsigned', NOT_NULL, 'default' => '0'),
'type' => array('varchar(32)', NOT_NULL),
// and so forth
),
'keys' => array(
'PRIMARY' => array('nid, vid', 'UNIQUE'),
'vid' => array('vid', 'UNIQUE'),
'nid' => array('nid'),
// and so forth
));The ‘cols’ array has an element for each column.
Each column specifier includes the mandatory information name, type, and whether or not it can be NULL as well as optional information such as default values.
I have the mandatory information as positional elements simply to reduce typing (to eliminate ‘name’ => $name, ‘type’ => $type, etc.). This may or may not be a good idea.
Optional information can include proprietary DBMS parameters, semantic information such as ‘join n-to-1’ => ‘tablename.column’, module-specific information such as what Views needs to know for fields, filters, arguments, and sorts, and so forth.
The ‘keys’ array has an element for each table index.
Each key specifier includes the name of the key/index, the columns included in the index, and whether or not it is a unique index.
Each module that wants to define tables in the schema implements hook_schema:
function mymodule_schema() {
$schema['#version'] = 1;
$schema['mymodule_table_1'] = array(/* table definition */);
$schema['mymodule_table_2'] = array(/* table definition */);
// etc
return $schema;
}The Schema module invokes hook_schema, collects the results, decide what tables need to be created, and hands those tables off to a DBMS engine driver that translates into appropriate CREATE TABLE statements.
An open question is how schema updates will occur. The simple solution is to continue using the present install approach:
function mymodule_schema_update_2() {
// Explicitly update the schema from version 1 to version 2.
db_add_column('mymodule_table_1', 'new_column');
db_remove_column('mymodule_table_1', 'old_column');
}Another possibility is to include schema version information in the table definitions:
$schema['#version'] = 2;
$schema['mymodule_table_1'] = array(
'cols' => array(
'new_column' => array(...,
'version_added' => 2),
'old_column' => array(...,
'version_removed' => 2),
// etc
));When the Schema module sees that the version for mymodule had changed from 1 to 2, it can scan through the tables looking for things changed in version 2 and automatically add or remove them (or invoke an update callback specified in the array). One advantage of this approach is that it stores current and historical information about schema structure in one place. However, there will probably always be some kinds of schema updates that require custom code.
Core or contrib?
I do not see any reason that the Schema system needs to be part of Drupal core. Any module that wants to participate in Schema defines hook_schema; any module that does not just continues to call db_query(‘CREATE TABLE’) in hook_install. If Schema is successful, over time more and more modules will support it, just a over time more and more modules have added support for Views.
Schema would not be very useful if Drupal’s core tables were not included, of course. The Schema module will provide the schema information for core tables via its own hook_schema.
What does this buy us?
The Schema proeject will provide many benefits:
Several modules (e.g. Views) already need schema reflection; this just centralizes the information and makes it much easier to create new modules that need reflection. A module that wants to support Views or other similar modules can store any necessary data in its $schema array. In fact, many modules will probably not have to do anything more to support such reflection-requiring modules.
Code size reduction and elimination of duplicate code. The automatically-generated schema for core is only slightly larger than all the explicit CREATE TABLE statements for a single database engine, but the single schema array will support multiple engines instead of requiring a duplicate set of CREATE TABLE statements for each engine.
The Schema module can use database inspection to validate the database layout against the defined schema, reporting errors and/or disabling any modules whose tables are not what they expect.
It can provide hook_schema_alter, allowing modules to alter other modules’ tables in a clean way. For example, a node-enhancing module could add additional columns to the node table, eliminating the need to join against its own custom table.
As already described, it is a huge step towards enabling incremental data migration across Drupal systems.
The Forms API made all kinds of things possible that were not before, many of them not foreseen until after FAPI existed. Schema will, too.
Arguable, this approach is somewhat more complicated than writing a single MySQL CREATE TABLE statement. Supporting Views is extra effort, too, but has a big payoff for module developers; so will this. I assert that it is easier to create a structure like this once than it is to write and maintain separate SQL statements for MySQL, Postgres, Oracle, Sybase, MS SQL Server, and god knows what else.
Current status and call for contributors
The Schema module currently knows how to:
Inspect a MySQL database to generate a schema array for existing tables and indexes (and, quite helpfully, output the PHP code to build the $schema array for those tables).
Call hook_schema for all modules that provide it.
Identify which tables returned by hook_schema do not exist in the database.
Generate and execute MySQL CREATE TABLE statements for missing tables.
There is a lot to be done:
The structure of the $schema array needs to be worked out.
General data type names and other column and index fields need to be defined (right now the code assumes MySQL-compatible strings).
A Schema engine driver for (at least) Postgres needs to be implemented.
Much more, I’m sure.
Please let me know if you’d like to help out!
Comments
There is a lot of interest
There is a lot of interest in the "killer app" aspect of migration between test/staging/production etc.
Why wouldn't you define
Why wouldn't you define these meta-schemata in XML? It seems that this is precisely what XML was created for.
I have a few answers to your
I have a few answers to your question:
1. The goal of this project is to allow Drupal to access and use information about its own schema (what I'm calling "schema reflection"). Sure, we could define the schema in XML instead of PHP arrays, but then we would just need to write some code to translate it from XML into PHP to gain the advantages of reflection. That wouldn't be too hard, but why bother? (*)
2. Drupal programmers are used to coding in PHP. They write their modules in PHP and define all their forms in structured PHP arrays. Not all of them are familiar with XML. I think the value of sticking to existing languages and paradigms outweighs the benefit of using XML just for the sake of using it.
3. A schema defined in XML would logically go into a different file than the PHP code which uses the schema. The schema update mechanism, as I described, is likely/certain to require custom PHP code in at least some cases. Having the schema definition and schema update code in different files separates them where I think it would be better to keep them together.
All that said, the potential upside ((*) i.e. "why bother?") to using XML is that it will probably lead better towards higher-level schema-description systems (ERDs, UML, whatever). As I said in my article, however, those higher-level systems are not my short-term goal. I'd like to create a system that the Drupal community will embrace and use. Too much abstraction or required new learning upfront will prevent that.
Thanks for the feedback!
There's actually a new
There's actually a new PHP-DB-Abstraction mailing list that is discussing DB-agnostic XML-based schema definitions as we speak. You may want to look into getting involved with them, too.
sounds very promising,
sounds very promising, barry. you are right that partial data migration is a killer app.
in addition to Views, you'll want to reuse relevant code and ideas from these projects:
- importexportapi. especially their definitions.
- MDB2 is database abstraction layer that is not that helpful, but MDB2 schema is an accompanying DB independant schema definition layer. Thats pretty nice. It is XML, which is non optimal compared to PHP arrays as you noted. But still worth borrowing ideas and maybe code. There are lots of drivers and tools available for this.
Are you proposing this as
Are you proposing this as part of the Drupal core, or as a contrib module?
Contrib, at least initially.
Contrib, at least initially. From the article:
It is possible that Schema will require some patches, new hooks, or whatnot in core in order to work. Time will tell.
Schema necessary
Schema necessary precondition for migration?
I am very very interested in the incremental migration aspect of your work.
The Repligard tool (http://www.midgard-project.org/documentation/concepts-repligard/) of the Midgard CMS is the only such tool in open source CMSses. In a nutshell, this tool uses schema information in XML files and relies on a separate repligard table that maintains globally unique IDs (GUIDs) for content objects and replicates content through XML files.
Another, commercial CMS that I am familiar with, SixCMS (http://sixcms.de), uses a very similar aproach based on GUIDs but allows incremental replication to use SOAP requests between a source and a target CMS.
The use of GUIDs needs to be considered when doing application development:
- Updates to content objects need to be tracked in the repligard table. This may only be valid for deletes.
- No longer is it permissable to use numeric IDs which are only locally valid but will change once a piece of content has been migrated. Instead, handcrafted content references need to use GUIDs.
As Moshe mentioned above, Drupal already has the importexportapi. As far as I am aware however, this does not maintain GUIDs. Adding these to Drupal will be the most important step.
Looking forward to you next moves.
--
Olav
Olav, thanks for your
Olav, thanks for your feedback! Ironically, just today I started writing a post for this site explaining my ideas for the Migrate module (it isn't done yet) and, yes, I do think Schema is a necessary precondition for doing it well. I am aware of the issue with migrating IDs and have a plan for it.
I was unaware of Midgard's Repligard tool and will certainly look into it.
Post new comment