I've been reading some interesting grumbling about the Prevayler project and the concept it espouses of "object prevalence". In particular, Ted Neward and Mike Spille (and the growing posse of commenters).
Rather than join in the comment stream, I thought I'd summarise my views here.
A lot of the arguments seem to bounce back and forth between "here are a bunch of reasons why Prevayler is worse than a database", and "here are a bunch of reasons why Prevayler is better than a database". Along the the way there is a lot of nitpicking about how much memory might be needed by an application, whether you get more "future-expansion" with a database, and so on. In short, a lot of head-bumping.
It seems to me that a wiser route may be to take for granted that there are likely to be situations where a relational database is a good choice, situations where some sort of "prevalent" approach is a good choice, and also situations where other options (plain text files, XML files, object database, JavaSpaces, paper index cards, no persistence at all, etc.) are a good choice. With this in mind, let's look for some situations where "prevalence" might be a useful component of a system.
In any system with a "master copy" and one or more "shadow copies" issues arise about how to keep the shadows in sync with the master. These issues include things like how to transfer information, how the shadows get to know that the master has changed, how up-to-date the shadows need to be, what happens when a new shadow is created, and so on. One way of thinking of "prevalence" is as a system where the in-memory representation is the "master copy". This is in contrast to typical database persistence where the information on the filesystem, managed by the database server, is the "master copy".
Everyone who has created a database-backed application is aware of these issues. Tough choices between cacheing efficiently-retrieved large chunks of data (but risking it getting "stale" if the underlying database changes) and clogging the app/database channel by fetching each data item every time it's needed. Worrying about whether (and how often) to poll the database for changes to cached data. Struggling to reduce response times by tuning queries. And so on. Recommending database solutions for their large capacity and shareability but glossing over these problems is very dangerous, but nonetheless common. When did you last see serious discussion of these kinds of issues on the web site of a database vendor? The same kind of optimism can be found among advocates of "prevalence". Pointing out the speed of fetching data from memory, and the way it gets round some of the problems of databases without stopping to highlight the potential "show-stoppers".
There seems to be a certain amount of (conscious or unconscious) selection of battlegrounds between the "prevalence" and "database" camps, though.
One of the biggest benefits of something like a database is that it can be accessed in many different ways by different applications. In some ways this attribute is ahared by other externally-specified data storage (such as CSV or XML files, JavaSpaces, WebDAV or CVS, and so on.) Data persistence that needs to provide stable access to multiple, unrelated applications naturally suits making the stored data the "master copy". There are obviously many classes of application (such as typical once-a-week or once-a-day business reports) that don't need to concern themselves with the master/shadow issues mentioned above. Another common benefit of a database is the capacity. With the likes of RFID tracking and on-line catalogue browsing generating massive streams of data, holding all of this in a mere few gigabytes of memory would be impossible. Luckily, terabyte disk arrays commonplace and (relatively) cheap, making external storage of such bulk data a natural choice, especially when reading and processing of such "logged" data is relatively infrequent.
So, if your problem implies multiple unrelated accesses, large data volumes, or a large proportion of writes compared with reads, choosing an "external master" system such as a database seems a good choice. Arguing the merits of an "internal master" system such as "prevalence" is unlikely to be successful or worthwhile in these cases.
One of the biggest benefits of "prevalence" is that the master data is private to the application. A "prevalent" system can safely assume that the only changes to its data will come via itself. The code does not have to defend against external applications tinkering with its data while in the middle of operations. "External master" data storage such as databases and flat files is by nature public, and making assumptions that what you put in a while ago is still the same now has caused many a hard-to-find application bug. When an application has complete control over the state of the data it can afford to run fast and loose.
So, for applications which completely manage their own data, don't have huge amounts of data to manage, and have a large proportion of reads compared with writes, an "internal master" system such as "prevalence" is a natural winner.
I would imagine that many hard-core database developers will find it hard to imagine that there are many real-life cases that fit the niche for "prevalent" persistence. Every day they deal with large data repositories, each of which has many different processes accesing it. It's easy to assume this is the great majority of applications. I would like to suggest that this may not be the case. To see why, I suggest (ironically enough) considering the common uses of the very popular MySQL database.
MySQL is installed as standard with most Linux systems, and is the default persistent data storage system for many development languages. When you look at the data stored in these MySQL databases, it becomes apparent that a huge amount of them have at most a few tens or hundreds of kilobytes of information stored in them. MySQL forms the data storage for wikis, blogs, to-do lists, bulletin boards, bug trackers, DNS servers, LDAP servers, message queues, mail delivery servers and an almost infinite array of similar small, self-contained, low-data-volume applications. These applications are not usually developed and maintained by database programmers and DBAs, but I bet the same DBAa and database programmers use them every day.
Typically, MySQL is used as the persistence layer for these applications because it is available and generally reliable. It's somewhere to put some stuff where it can be found again later. However adding MySQL support for an application is tricky, even when it's built in to a development language or environment. Mapping an arbitrary arrangement of data to a clean relational schema is not a trivial task. So there are lots more of the same class of application that either take a faster and simpler (but more risky) approach of accumulating changes in memory and occasionally writing state out to flat files, or take the performance hit of writing a complete updated flat file for every state change.
To me, these are the class of applications that would most benefit from a "prevalent" approach. They gain the fast response and programming flexibility of an in-memory master, and don't hit the problems of parallel unrelated access and large data volumes.
I think you have missed the point of "prevalence" a bit, Jeroen. In a "prevalent" system, the RAM remains the master copy, but each time anything changes a small note of the change is made on an external storage. If the computer crashes (or whatever), all that is needed to recreate the state of the RAM at the crash is to re-run the saved change notes. If the change-note list grows too large, a complete "snapshot" of memory can be taken as a new starting point for a new set of recorded changes.
The classic analogy is that of an accounting system where you don't need to write out all the data after every transaction, just the details of that transaction (when, how much, from where to where).
Read more...
Read more...
Read more...
Read more...
Read more...