RainStor: Making Hadoop Fly

by Robin Bloor on February 21, 2012

RainStor claims to offer the most effective method of compression of all the “Big Data” databases, and in all probability it does. At the heart of any database is the engine that gets you data. Its role is to get you (or to be exact, your query) the data it wants as fast as possible. At the same time, because a database is a busy little beaver, it has to effectively manage every other query that has been thrown at it.

The strategy that the traditional relational database management system (RDBMS) product adopted was to store data on disk in indexed structures (btrees usually) and then pull the data in as needed, often reading it from the index. Additionally, it cached data in memory and tried to make the most efficient use possible of the available CPU power. That kind of database architecture is fast becoming outdated. It worked fine for transactions, and fine for data warehousing, up to a point. The point where it broke was when data volumes grew too large for that arrangement of data to deliver adequate response times. In short, it didn’t scale.

So now there is a plethora of new database products emerging with distinctly different engines that scale far better than the old RDBMS products do. While a few of them are similar in their approach to achieving scalability, many of the new kids on the block are distinctly different. RainStor is one such product.

At the heart of RainStor lies a unique approach to data compression. To appreciate RainStor’s architecture, you need to have some understanding of how it compresses data. This is illustrated in a simple way in the diagram below.

RainStor Compression: A Simplified Diagram

Instead of storing database tables as tables, RainStor converts the tables into a kind of tree structure. The top diagram shows the first record in a table being stored; First Name – Peter, Last Name – Smith, Company Classification – Automotive, Salary – $40,000. Not only are the values stored but also the links between these values, which preserve the fact that it is a distinct row. The second diagram shows the way the second record is stored; Paul, Smith, Finance, $35,000. Because the value Smith already exists it doesn’t need to be stored; it just needs to be linked to. The values are colored green in the diagram for the sake of clarity. The bottom diagram shows the addition of the third record, in yellow; John, Brown, Pharmaceutical, $40,000. This process continues with the addition of every new record. If a completely new table is added to the database, a new tree is built. RainStor employs additional techniques to further compress the data, by applying byte level compression within field values, but the fundamental approach is as we have described.

By using this method of organizing the data RainStor reduces the space it requires considerably, because you only store as many values for any column as the cardinality of the column. This compression technique has the additional advantage that data ingest is extremely fast.

As far as we are aware, Rainstor’s data compression scheme is at least four times as economic as, for example, the typical column store database. And it is not the only advantage of storing the data in this way. RainStor can retrieve the results of any query on the data simply by “walking the tree.” So it compresses the data considerably without damaging its ability to query it. There are many ways to compress data, and they all carry a penalty when data is decompressed. With RainStor the penalty is very small.

RainStor and Data Archiving

After its initial release, RainStor quickly established a market for itself in the area of data archiving/data retention, where it offered a unique capability – enabling what can be thought of as an on-line archiving capability. Data was stored in RainStor, where it was held economically on disk, but was still available for query through Rainstor’s SQL interface. Used in this mode, data that was archived from a typical relational database could be compressed to occupy (roughly) one fortieth of the disk space it previously consumed. It was an attractive alternative to relegating data to tape back-ups where it was no longer accessible quickly, if at all.

RainStor could have been used, and indeed can be used, as an in-memory database, but at the time of its first release (several years ago) there was little enthusiasm for in-memory database products, so the company pursued a niche market where no other product could match its capabilities. The opportunity for RainStor broadened significantly with the advent of Hadoop – allowing it to enter a market where its underlying architecture and method of storage can pay huge dividends.

RainStor and Hadoop

RainStor was quick to take advantage of Hadoop. Because of its tree-based data structures, RainStor can easily distribute data across multiple servers. RainStor’s processing kernel is very light-weight and thus it is possible to run it on every server. So as long as you break down SQL requests and any associated processing to the individual server level, you have a highly scalable configuration.

In its Hadoop-based implementation, RainStor simply distributes its data trees through the Hadoop file system (HDFS), which has, as you may be aware, built-in redundancy and hence can be recovered quickly in the event of any node failure. RainStor maintains a common metadata map that it uses to break down requests written either in SQL or Apache PIG (an analytics capability) or both. It distributes subqueries to the nodes in a Hadoop cluster or grid, processes them locally at each node and then aggregating to produce the answer. In such a configuration, RainStor’s benchmarks suggest that it will outperform a MapReduce configuration by a wide margin.

The speed that RainStor is capable of varies according to how much of the data any given workload needs to access. For example, if it is a batch query that touches all data, RainStor outperforms MapReduce by a factor of about three. However, for lighter ad-hoc queries against the same data, it is much faster, possibly by two order of magnitude (i.e., 100).

There are two other advantages that RainStor provides in such a context, aside from blistering speed.

  • First, far fewer servers are needed because RainStor is so economic in its use of disk space.
  • Secondly, thee is no need for a programmer to write Java code or grapple with the complexities of the MapReduce approach. Queries can all be presented in PIG and SQL.

It is also possible to use RainStor in a distributed in-memory manner (backing data off to disk) in such a configuration, if you’re willing to invest in the memory and you desperately need the speed.

Those companies that have a genuine need for the large scale processing that Hadoop is capable of would do well, in our opinion, to take a close look at RainStor.


{ 6 comments… read them below or add one }