Build your own Google

In an earlier post I talked a bit about a start-up called Kickfire who are releasing an accelerated database appliance for MySQL applications in the 1TB-3TB range that gives comparable performance to an equivalent Oracle RDBMS setup for around a quarter of the price.

Kickfire sounds compelling for utility LAMP computing applications that need to be fast and cheap, but it doesn’t address ultra-scalable and cheap, at least not yet. If you want to build some big Web 2.0-type application, such as a Facebook, a Yahoo or a Google, or even something very storage and database intensive like a bioinformatics application or geophysical data modeling, you are going to need to store very, very large amounts of data, in the hundreds of terabytes.

Bee Movie
Credit: Dreamworks Animation LLC

Another new company, Aster Data, which was formed by a group of Computer Science PhD’s from Stanford University, is attempting to go with a similar Open Source technology-enabled appliance approach to Kickfire’s, but are applying it to horizontal database scalability for roll-your-own data warehousing applications.

Aster uses the same philosophy to scaling databases the same way “Beowulf” clusters are used for compute-intensive applications, such as with CGI render-farms and DNA-sequencing, high-energy physics and weather simulation. But instead of just distributing application memory and CPU cycles over a large number of nodes using a shared memory and parallelism API like MPI-2, where slave nodes boot off a high-speed network (such as Myrinet, multiple bonded gigabit interfaces or 10GigE) on a master node that issues instructions and use shared storage, Aster uses a massively parallel “Beehive” approach for database storage, where “Worker” nodes, each with their own storage, CPU and memory are provisioned and controlled by “Queen” and “Loader” nodes.

The “Queen” is a Linux appliance with all the software needed to give birth to new Workers, which are tweaked out PostgresSQL drones that share a distributed copy of your database, much like the way striped and parity data is stored on RAID drive arrays. All your IT staff has to do is add a bunch of totally virgin, new commodity servers to your network, boot via PXE from the Queen, and she does all the work of installing a new OSes on your Workers and building out your database schema. The result is a highly distributed and highly available database that can scale easily into the hundreds of terabytes.


Google and other large Web 2.0 operations have had to design totally proprietary, specialized systems in order to scale their databases that large – but with Aster, all that scalability is completely transparent because your front-end apps and your middleware business logic is the same as it always was on monolithic systems – applications communicate with the Queen via standard ODBC and JDBC interfaces, so you don’t have to do a ton of re-coding to stand up your own Hive. The “Loader” nodes are responsible for partitioning and loading datasets onto the workers and communicate with data federation services.

Aster Beehives are such a compelling technology that it has already been noticed by the big guys. MySpace already uses it for storing large amounts of data and issuing huge numbers of transactions – and one of the original financial backers of Google, Sequoia Capital, has financed the company through it’s A-round of venture capital financing.

Now, if only Kickfire and Aster could join forces. Then you’d really have something.