Friday, August 28, 2009

Maintaining Data Integrity

I promised several posts ago that I'd discuss some of the interesting internals of Antichrist Watch in more detail.

I've since gotten a bit off track, posting three entries about some of the Repoze stuff I've been doing lately. As I've been getting deeper and deeper into other projects I realize I'd better catch up and make good on my original promise or else I may never get back to it.

Today I figured I'd cover the principles behind Antichrist Watch's back-end database connection. It's using PostgreSQL and making heavy use of some of the features that many other database servers lack. In particular, PostgreSQL has great support for embedded logic (also known in database land as stored functions or stored procedures); not only does it have its own PL/pgSQL (similar to Oracle's PL/SQL) it also supports embedding code in other languages like Python, Perl, Tcl, and others besides. This embedded code can be tied into constraints that get executed (roughly) prior to a SQL command, rules that get executed (roughly) during a SQL command, and triggers that get executed after a SQL command. Additionally these stored procedures can be called directly allowing not only the effective batching of SQL commands but also multiple SQL commands bound together logically (that is, it's trivial to make a stored procedure execute one list of SQL commands under one set of conditions but execute another under a different set of conditions), or fashioned into views that behave like tables.

This combination gives the designer two big advantages over less capable database systems: first, it becomes possible to embed code to help guarantee data integrity within the database itself by careful use of constraints, rules, and triggers; and second it becomes possible to hide the database's implementation details from any client code by creating a public API via stored procedures. Thus we're basically using information hiding and restricting data access to certain controlled channels. I'm guessing that anyone who has read this far will understand in principle why these are good things. I'll carry it further though and list some specific benefits obtained by using this type of database design.

No raw tables are accessed in Antichrist Watch. It has a database access API fashioned from stored procedures and views. This level insulates the client from the raw tables; right now internally it is heavily normalized and even uses table inheritance to avoid repetition, but if at some point down the road it became necessary to duplicate some information for the sake of performance, the tables could be completely redesigned without requiring any change in client code whatsoever. Furthermore, with a clean interface it's easy to change clients. Antichrist Watch has had three completely different front-ends already. My original version was hacked together quickly from basic Python; my second version used PHP; the current version uses Twisted and Zope Page Templates (via Repoze Chameleon). At each stage making the transition was reasonably easy. It would have been a bear (possibly even unthinkable) if the interface were using raw tables and the client was responsible for data integrity.

On that note it must be said that data integrity is also improved. I'm a big believer in trying to enforce data constraints as close to the data itself as practically possible. The further away the constraint enforcement is from the data, the more space there is for loopholes, security holes, and logic errors to breed. Antichrist Watch has all data constraints and relationships enforced by the database itself. Even if someone with malicious intent were to somehow get beyond the client-side JavaScript and server-side Twisted Python (or some bugs were to crop up in my code, be it client-side or server-side), the allowed options on the database side are restricted limiting the potential damage that could be done. Basically it would be possible for fake data to get inserted, but it would not be possible for invalid data to get inserted, and this can mean the difference between a minor failure (accompanied by some embarrassment) and a critical failure (accompanied by a system crash). As an added bonus overall performance also often improves by moving such data constraint enforcement into the tightly optimized world of the database server itself.

Please note that I'm not advocating the moving of all logic out of things like Zope, Twisted, or PHP into the database server itself. This would be counterproductive. Tools for developing in languages like PL/pgSQL tend to be more primitive than tools for developing in languages like unembedded Python, and application logic definitely belongs outside the database. What I am advocating is making the database responsible for its own data checking while hiding its source internals.