MarkLogic: A Big Data DBMS of a Different Kind

by Robin Bloor on February 29, 2012

We can think in terms of there being two strands of database technology: relational databases and object databases. In the area of transaction processing, relational databases have dominated over the past few decades and they have also dominated the world of the data warehouse. However, it has not been entirely one sided. Object databases did quite well for transaction processing in niche areas where extreme performance was called for, and they have also done well in the area of awkwardly structured data –  by which we mean data that doesn’t fit easily into relational table-based structures.

It’s also the case that some of the “new NoSQL” databases are document stores (object databases by another name) suited to storing documents, web pages and other such nested structures. This is undoubtedly a consequence of some organizations accumulating Big Data heaps that embody awkwardly structured data.

What Does MarkLogic Do?

MarkLogic could be classified as a NoSQL database, but reality is a little more complicated than that. MarkLogic is a reasonably mature product, now in its sixth release, rather than one that recently appeared by magic from nowhere. It has many established customers. For example, the Federal Aviation Administration deploys MarkLogic for its Emergency Operations Network, McGraw-Hill uses the product for document processing and JPMorgan Chase runs MarkLogic for derivatives processing. None of these are straight-forward applications that could be handled by a scale-out Big Table kind of database.

MarkLogic can accurately be described as a rich media database, in the sense that it handles textual data and a whole host of rich media formats (about 200 according to the company) as well as the structured data that normally inhabits relational databases. MarkLogic uses XML encoded documents to establish its core data model, and as such it is truly an XML database. In practice this means that it can happily store financial contracts and other complex contracts, medical records, legal filings, PowerPoint presentations, blogs, tweets, press releases, user manuals, whole books, articles, web pages, message traffic, sensor data, and emails. These are all considered to be documents for MarkLogic’s perspective and they can be ingested directly if they are already formatted as XML documents. And if they are not, they can be quickly transformed into XML documents during the ingest process. Because of the XML mark-up, data can be extracted directly by the database from any kind of object.

MarkLogic is also a true transactional database. While some so-called NoSQL databases have compromised the ACID (Atomicity, Consistency, Isolation and Durability) properties that are important for transaction processing, MarkLogic has not. MarkLogic is fully equipped to be a transactional database, and if you simply wanted to use it for, say, order processing, there would be no problem in doing so.

Finally, MarkLogic is a search-oriented database. This is in, our view, the most important aspect of the product and the one that defines its role. The database has been built to enable rapid search of its content in a similar manner to the way that Google’s search capabilities have been built to enable rapid search of the Internet. As some of MarkLogic’s implementations have grown to above the single petabyte level, fast search of massive amounts of data is one of its most important features.

To enable its search capability MarkLogic indexes everything on ingest; not just the data, but also the XML metadata. This provides it with the ability to search both text and structure. For example, you might want to quickly find someone’s phone number from a collection of emails. With MarkLogic you could pose a query such as:

“Find all emails sent by Jonathan Jones, sort in reverse order by time, and locate the latest email that contains a phone number in its signature block.”

You may be able to deduce from this that Mark Logic knows what an email is, knows how to determine who the sender is, knows what a signature block is and knows how to identify a phone number from within the signature block. If you were looking for a mobile phone number then you would simply add the word “mobile” in front of phone number.

It should be clear from this that very few databases could handle such a query, because most databases are straight-jacketed by whatever version of SQL they implement and, even if it were possible to bend SQL in such a way as to formulate this kind of query, most databases cannot dig into data structures they hold in the way that MarkLogic can.

How Does It Work?

Figure 1. MarkLogic's Scale-Out Architecture

MarkLogic’s architecture as illustrated simply in Figure 1. There are evaluator nodes (e-nodes) which process queries and data nodes (d-nodes) that provide access to data. E-nodes wait for requests, parse them and assemble responses from data passed to them by the d-nodes. The d-nodes hold all the data and indexes, receive part of a query from an e-node and provide their response. If the workload increases, you add more e-nodes, and as the data grows you add more d-nodes. The load balancer distributes the workload according to resource availability. Scale-out is achieved simply by adding nodes as needed, and we we already noted, in practice it has been used up to the petabyte level.

Failover is achieved by either replicating data across multiple nodes or by using a SAN. If a d-node fails, work can be assigned to another node holding the relevant data, or in the SAN configuration it can simply be assigned to another node as a standby d-node is brought online.

Schema and Indexing

As regards schemas, MarkLogic is “schema agnostic.” It could, for example, ingest all the data from a relational database along with its schema information, but it would hold the schema information in XML. When ingesting data that is not so conveniently structured it can use heuristics to deduce the nature of some items, such as a phone number, name, title and so on, possibly making use of context. When data of a specific kind is ingested you can provide XML tags to indicate meaning. For instance in the above email example, you might indicate a signature block by using a tag. In summary, MarkLogic will try to automate the ingest of new data, but you can provide markup guidelines to assist it. MarkLogic provides “entity enrichment,” and you can enhance its capabilities and, in specialize situations (such as face recognition in photographs), you can use third-party software to identify recognizable items.

MarkLogic’s query language is the W3C-standard XQuery, which is fairly easy to learn as it is purpose-designed to query, retrieve, and manipulate XML-formatted data. MarkLogic recently added support for XSLT, which is also a W3C-standard language. This can be used for transforming data during data ingest and for data output. It can directly produce XHTML documents –  making it possible for MarkLogic to be deployed directly as a web server. Both languages or a mixture of one and the other can be used to access and transform MarkLogic data.

The database embodies a very sophisticated set of index structures which collectively support its fast retrieval capability. How these are used depends to some extent on the data that is stored. There are, or can be, full text indexes, XML Indexes, scalar indexes (for values) and geospatial indexes. MarkLogic also provides reverse indexes, in effect indexing queries to which answers can quickly be retrieved. This has particular use for generating alerts for particular conditions when new data (new documents for example) are being ingested. It has been built specifically to deliver very fast answers to queries of any kind, including Google-type queries and queries more typical in a relational database environment, involving ranges (e.g., all surnames from “A*” to “B*”)

In Summary

MarkLogic is clearly a product worth investigating if your problem is defined by very large amounts of data that embody complex data structures, and especially if geospatial or text data is part of the mix. There is no reason why it could’t be used in more standard environments, depending on what you wish to achieve. It has no formal SQL support and it provides no relational schema for SQL to get its teeth into. However you can easily get at SQL databases and transfer the data to the MarkLogic environment using MLSQL, an open source XQuery library. What you cannot do is directly use SQL-based BI tools to access MarkLogic data.

{ 2 comments… read them below or add one }

Jefferson Braswell March 14, 2012 at 10:47 pm

This is an attractive scalable architecture for a XML database with robust indexing capabilities.

One caveat: even though MarkLogic can store and index the document metadata of any document without schema foreknowledge (by virtue of the metadata being embedded in an XML document), the XML documents must conform to equivalent XML Schemas and namespaces in order to relate different documents using common semantic attributes shared across multiple documents.

This is largely a taxonomy issue of course, and the quick-and-dirty approach — i.e., one that would relieve the burden of addressing XML Schema conformance among documents — could include:

1. Take the tag identifier name at face value. For example, in one document would be equivalent to in another document of a different type. Of course, in one document may be ‘last name’, and in another document following a different XML Schema, could be ‘link name’

2. Apply additional XML Schema full path equivalence in order to consider embedded metadata tags to be equivalent. For example, “order/shipping address/lname” would not match “region/gateway/lname”

How does MarkLogic address the semantic equivalence issue when dealing with different XML namespaces ?

Reply

Jefferson Braswell March 14, 2012 at 10:54 pm

The comment posting removed (or rather ‘processed’) part of my comment that I expressed using HTML tag notation ! My bad.

In the previous post, in the paragraph for point 1, the comment posted as follows:

1. Take the tag identifier name at face value. For example, in one document would be equivalent to in another document of a different type. Of course, in one document may be ‘last name’, and in another document following a different XML Schema, could be ‘link name’

It should have read

1. Take the tag identifier name at face value. For example, in one document “lname” would be equivalent to “lname” in another document of a different type. Of course, in one document “lname” may be ‘last name’, and in another document following a different XML Schema, could be ‘link name’

Reply

Leave a Comment

Previous post:

Next post: