It is very difficult to work with UNIX/Linux-based systems without having heard of the ubiquitious dbm family of routines, and most people know there are a number of them, including the newer NDBM, GNU's GDBM, SDBM, and the Berkeley DB abstraction functions which also provide DBM compatibility.

There is a new kid on the block: Tokyo Cabinet. It has been designed to improve on space efficiency (smaller databases), provides faster processing, higher prallelism, and it supports 64-bit architectures, thereby providing enormous databases, if required. Tokyo Cabinet really does appear to be fast, but it is probably better to conduct your own benchmarks, because somebody obtained other numbers.

Tokoyo Cabinet provides a library of routines to manage a database -- a simple data file containing records, each of which is comprised of an arbitrary key and value, both of which can be strings or binary data. Records in the database are organized in a hash table, a B+tree or as fixed-length array. A variant of the hash database called a "table database" is also possible. Here, each record is identified by a unique key, and it has a set of named columns. (No, this has nothing to do with SQL.)

Documentation is very good, and I recommend you start off by reading a presentation, before going on to a more in depth documentation, including the specs and the Perl API.

An add-on to Tokyo Cabinet, called Tokyo Tyrant provides a versatile network interface to a Tokoyo Cabinet database. Although your typical DBM-type database is a local-only affair, there are occasions, where you will want to provide multiple readers/writers to a single database (not possible with Tokyo Cabinet only), or you'll want to access a database on a remote machine. Instead of designing your own network protocol for doing so, Tokyo Tyrant provides a simple API to accomplish the task.

Together with the API (i.e. the library iteself), Tokoyo Tyrant supplies a set of utilities that surface its API to the command-line. For example, I launch the ttserver program in one window, specifying the name of a database I want to manage. (This hash-database is created automatically, if it doesn't exist.)

$ ttserver mydatabase.tch
    2009-08-26T22:34:10+01:00       SYSTEM  --------- logging started [9274] --------
    2009-08-26T22:34:10+01:00       SYSTEM  server configuration: host=(any) port=1978
    2009-08-26T22:34:10+01:00       SYSTEM  opening the database: mydatabase.tch
    2009-08-26T22:34:10+01:00       SYSTEM  service started: 9274
    2009-08-26T22:34:10+01:00       INFO    timer thread 1 started
    2009-08-26T22:34:10+01:00       INFO    worker thread 1 started
    2009-08-26T22:34:10+01:00       INFO    worker thread 2 started
    2009-08-26T22:34:10+01:00       INFO    worker thread 3 started
    2009-08-26T22:34:10+01:00       INFO    worker thread 4 started
    2009-08-26T22:34:10+01:00       INFO    worker thread 5 started
    2009-08-26T22:34:10+01:00       INFO    worker thread 6 started
    2009-08-26T22:34:10+01:00       INFO    worker thread 7 started
    2009-08-26T22:34:10+01:00       INFO    worker thread 8 started
    2009-08-26T22:34:10+01:00       SYSTEM  listening started

ttserver listens on port 1978 on all host addresses by default; you can change this with command-line switches.

I now add two records to the database from another window:

$ tcrmgr put localhost jp "Jan-Piet"
    $ tcrmgr put localhost time "`date`"

The put adds a key, possibly overwriting an existing key. localhost is the name of the host on which ttserver is listening, and the third and fourth arguments are the key and value respectively.

I can list the keys on the remote database with:

$ tcrmgr list localhost 

retrieve a single value identified by a key

$ tcrmgr get localhost jp

delete individual keys, etc. and also list the content of the database (all keys and their values):

$ tcrmgr list -pv localhost
    jp      Jan-Piet
    time    Wed Aug 26 23:36:58 CEST 2009

I mentioned above already, that these commands surface the API onto the command-line; what you see here (and much more), can obviously be accomplished embedded into your application.

And there is more. Let me try to connect to ttserver via HTTP, using curl (note how I'm using the default port 1978):

$ curl http://localhost:1978/a
    Not Found

Hmm. Disappointed? Don't be: I attempted to retrieve a key called a, which doesn't exist. If I do

$ curl http://localhost:1978/jp

there is the value we stored above for the key jp. Wow? Yes: Wow!

Back to the Tokyo Cabinet API itself. The following program (a slightly modified bit of sample code) reads through our on-disk database, and enumerates key/value pairs:

#include <tcutil.h>
    #include <tchdb.h>
    #include <stdlib.h>
    #include <stdbool.h>
    #include <stdint.h>
    int main(int argc, char **argv)
      TCHDB *hdb;
      int ecode;
      char *key, *value;
      /* create the object */
      hdb = tchdbnew();
      /* open the database */
      if(!tchdbopen(hdb, "/tmp/mydatabase.tch", HDBOREADER|HDBONOLCK)){
        ecode = tchdbecode(hdb);
        fprintf(stderr, "open error: %s\n", tchdberrmsg(ecode));
        return (-1);
      /* traverse records */
      while((key = tchdbiternext2(hdb)) != NULL){
        value = tchdbget2(hdb, key);
          printf("%s:%s\n", key, value);
      /* close the database */
        ecode = tchdbecode(hdb);
        fprintf(stderr, "close error: %s\n", tchdberrmsg(ecode));
      /* delete the object */
      return 0;

Note how, when I open the database with tchdbopen() I use the flags HDBOREADER | HDBONOLCK. The former defines my program as a reader (I can have many readers but one writer only) and I've set the program to not lock. Tokyo Cabinet defines its locking strategy as follows: I can have many readers or one writer, but not both together. This is the reason for Tokyo Tyrant : I can circumvent the issue by using the client-server model it provides.

When the sample program runs, I see the list of key/values contained in the database:

    time:Wed Aug 26 23:36:58 CEST 2009

Tokyo Tyrant can also be built with embedded Lua. The Lua extension allows the database server to read a Lua script file, and clients can call functions defined in those scripts. User-defined functions can then access all of Lua's offerings, in addition to using routines exported by Tokyo Tyrant to log messages, store and retrieve records, etc.

Suppose I have the following Lua script, on the same database I used above:

function say(key, value)
      _log("About to store " .. key .. "/" .. value, 1)
      _put(key, value)
      return "Thanks"

and launch ttserver with that script

$ ttserver -ext jp.lua mydatabase.tch

I'll now use the ext command in the client to invoke my user-defined Lua function say(); Tokyo Tyrant automatically passes it the key/value pair we give it:

$ tcrmgr ext localhost say surname Mens

The Thanks is what my Lua function returns. Has the record been inserted? Yes, it has, because my say() function uses the built-in _put() routine.

$ tcrmgr list -pv localhost
    jp      Jan-Piet
    time    Wed Aug 26 23:36:58 CEST 2009
    surname Mens

That is Wow!

In addition to all this, Tokyo Tyrant can replicate a database onto an additional server, allowing me to easily create fault-tolerant remote databases. Check the documentation for more on this. Here again, I recommend you start with the presentation, which gives a good overview of Tokyo Cabinet and Tyrant capabilities.

(If you mainly use Lua and want to access Tokyo Cabinet databases, there is also a binding from Lua to Tokyo Cabinet.)

Tokyo Cabinet and Tokyo Tyrant, both written by Mikio Hirabayashi, are very lightweight, written in C, and they provide APIs in Perl, Lua, Ruby, and Java.

Flattr this
Linux, Database, MacOSX, CLI, C, DBM, Replication, and HTTP :: 06 Sep 2009 :: e-mail


blog comments powered by Disqus