Re: Berkeley DB is a non-relational high-performance system/paradigm - anyone looked at it? from Keith Bostic on 2006-10-29 (public-semweb-lifesci@w3.org from October 2006)

From: Keith Bostic <keith.bostic@oracle.com>
Date: Sun, 29 Oct 2006 08:52:57 -0500
To: public-semweb-lifesci@w3.org
Message-Id: <91DF6521-E3FB-4174-8387-201DCD1A2813@oracle.com>
> ... and notorious for its instability and possible corruption whenever
> you have more than 1 thread/process/machine accessing the same  
> database.
> (For instance, the BDB based backends for Subversion repositories  
> quite
> often got corrupted)

Folks, this is just wrong.

There are two issues here:

First, when people talk of Berkley DB, they are sometimes referring  
to the original Berkeley DB release, version 1.85, which was first  
distributed around 1992.  BDB version 1.85 is still in wide use in  
many, many applications because it's ubiquitous, has a tiny  
footprint, is easy to find, and works well for what it is: a simple  
database engine, supporting Hash and Btree access methods.  BDB 1.85  
doesn't support locking or transactions in any form, and if the  
application or system fails, data corruption or loss is possible.

The current release of Berkeley DB was made a month ago, and is  
Berkeley DB version 4.5.   BDB in 2006 is a fast, scalable,  
transactional database engine with industrial grade reliability and  
availability.

Second, when people talk of Berkeley DB stability, they are sometimes  
referring to the widely-known problems seen when BDB is used as the  
underlying engine for Subversion.  To simplify the issues, the  
problem was the Subversion use of BDB, caused by the fact that  
Subversion/BDB installations did not run transactional recovery after  
application or system failure.  This was an architectural issue: both  
BDB and Subversion are libraries, and there was no way in either  
piece of software to know when transactional recovery was  
necessary.   Since recovery wasn't being run after system or  
application failure, of course instability and data corruption  
resulted.  In February of this year, Collabnet and Berkeley DB  
engineers collaborated on a new set of APIs for the BDB library so  
that transactional recovery would be automatically run after  
application or system failure, which resolved this problem.  For  
details, you can see the Collabnet press release on the topic:

	http://www.collab.net/news/press/2006/sleepycat.html

To be absolutely clear -- the problems with Subversion were NOT  
problems or bugs in Berkeley DB, they were the result of incompatible  
interfaces between two software components.

I don't want to turn this into a marketing presentation, but given  
how this conversation started, I think it's fair for me to give you a  
couple of examples: Berkeley DB is the database engine behind Sun  
Microsystems LDAP directory server, Google' s replicated Single Sign  
On service, Openwave's Email Mx product and the Amazon web site.

Yes, that's right: when you log into Amazon, that customized page you  
see is built by roughly 1,000 accesses to Berkeley DB databases. And  
when you log into Google's gmail, your account information is stored  
in Berkeley DB.

And, I can promise you two things: first, that every one of those  
products has a lot more than 1 thread or process accessing data at  a  
time, and second, that every one of these companies wouldn't be using  
my technology if there was better or more reliable technology available!

> But hey, that's to expect when you go for a plain file-store  
> instead of
> something with a server backend. With a
> single-thread-process-host-architecture it could work great.

Yes, Berkeley DB runs on top of the filesystem, it doesn't require a  
raw partition on which to run.  That said, that's a feature, not a  
bug!  For that reason, BDB doesn't require the server be brought down  
in order to increase the size of the raw partition, hot backups and  
archival can be done with the standard system tools, and there are no  
additional administration requirements.

On top of the filesystem, Berkeley DB provides a transactional engine  
that offers B+tree, Queue and Hash access methods.  The transactions  
are like everybody else's: write-ahead logging, cursors, multi- 
version concurrency control, fine-granularity locking, multiple  
degrees of isolation, high-availability and fault tolerance through  
replication, and so on.

> Remember also that this is not magic "paradigm", it is just a disk  
> based
> hash table.

This is wrong.  Berkeley DB isn't just a Hash table (in fact, it  
never was, even Berkeley DB 1.85 had a B+tree as well as a Hash table).

Berkeley DB offers a B+tree implementation, which is pretty  
standard.  But, Berkeley DB also offers a Queue access method with  
atomic consume operations, as well as a Extended Linear Hash access  
method for data sets sufficiently large relative to the cache that  
Hash will out-perform a B+tree.

In summary, the Berkeley DB of 2006 isn't your parent's Berkeley DB. :-)

Regards,
--keith

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Keith Bostic
+1-781-259-3139
keithbosticim (ymsgid)
keith.bostic@oracle.com
Received on Monday, 30 October 2006 02:35:08 UTC