NoSQL and Big Data Analytics Roundtable Takeaways

by Steve Dine on April 11, 2012

Author Steve Dine participated in a roundtable discussion on NoSQL and Big Data Analytics held live on March 14. Listen to the archive here.

There is little doubt that increasing data volumes are presenting challenges for organizations with regard to both analytics and data management. Organizations are not only capturing more data today, but also looking to perform more advanced analysis on more sources and types of data. This has presented a challenge for those responsible for meeting the analytic requirements of users and managing the data assets of an organization. In a recent Bloor Group roundtable, we discussed these new big data challenges and how the new breed of NoSQL solutions fit in to the Business Intelligence (BI) landscape. A number of themes emerged:

  • NoSQL is really more about being “non-relational” than its support of a structured query language.
  • The current NoSQL data stores are relatively immature compared to relational database management systems.
  • There is a skills gap within BI organizations that will need to be addressed before they can effectively leverage the existing NoSQL solutions in the market.
  • Organizations need to understand their data workloads and analytic requirements before choosing any database solution. 

NoSQL or is that “Non-Relational?”

There has been a great deal of hype around the “NoSQL” approach to data storage and access. While it may lead some to believe that those behind the effort are anti-SQL, the truth is that many feel that it is more accurate to refer to NoSQL as “non-relational.” Many industry analysts feel that it’s not about eliminating SQL from the database, but rather increasing system scalability. NoSQL data stores are also concerned with storing and accessing data that may be unstructured or semi-structured, document-oriented, and/or contains complex multipath relationships. An advantage to these solutions is that they allow for schema tolerance, which means that you don’t necessarily have to conform to a predefined schema when loading or analyzing the data. If an attribute doesn’t exist as part of the tuple, then it simply isn’t stored.  The NoSQL data stores also don’t support predefined relational constraints, such as foreign keys. The relationships are denormalized into the structure of the data set itself or resolved at runtime via map-reduce programs. This is why many have led a push to rename these solutions from NoSQL to non-relational.

Who are You Calling Immature?

While it’s true that maturity is relative, in consulting circles we tend to measure maturity of software based on the level of risk it presents in the ability to meet the both customer requirements and the timeline of a project. Others often measure maturity of software based on the version number of the latest release. With NoSQL solutions, both measures might deem it as being immature. Solutions such as Cloudera’s Hadoop, are still in version 0.x; other projects, like MongoDB are only up to version 2.x. While a version number in itself is somewhat meaningless, what it often translates to is fewer capabilities more “undocumented” features, and greater risk. During a recent customer proof of concept, we worked with one of the NoSQL data stores, whose name has been withheld to protect the guilty. The first challenge we ran into was that the product did not generate error messages when an incorrect MapReduce program was run to query the data store. It simply returned zero records, which made it somewhat of a challenge to determine whether there were errors in the programs or just that there were actually no records in the result set. Another major challenge was that an upgrade was required to resolve stability issues later in the project, which caused a number of the MapReduce programs to return incorrect results. While each of these issues was handled by our team, the net result was delays in the project and a loss of confidence in the NoSQL solution by the customer.

Put Me in Coach, I’m Ready to Learn NoSQL

With the maturation of enterprise BI suites, ETL tools, and relational database management systems, the vast majority of the development can be accomplished via point-and-click interfaces. We’re often lucky to find ETL developers who can develop a stored procedure or report developers that can write correlated sub-queries. It’s not that our resources are any less intelligent, it’s that the BI tools require less technical interaction and so many BI organizations hire less technically skilled resources. In addition, even for the more technically skilled BI teams, the core technical skills required to work with NoSQL solutions are fundamentally different from what they possess today. Aside from the need to interact in a command line-only environment, the NoSQL solutions require programs to be written in order to load, manage, and query the data. A strong knowledge of a programming language, such as Java, JavaScript, Python, Ruby, C++, and/or Erlang is required. These skills are vastly different from SQL, stored procedures, and vendor-implemented functions. While languages, such as Pig and Hive, aim to make it easier to interact with NoSQL data stores, there is still a significant skills gap that will need to be closed before most BI organizations can effectively leverage NoSQL solutions.

Don’t Choose a NoSQL Solution Based on Hype Alone

Along with cloud computing, “big data” was one of the most talked about data related subjects in 2011. Every organization was made to feel as though they had a big data problem and NoSQL was the answer. Those pushing these solutions proclaimed that relational databases weren’t scalable enough to handle growing data volumes and the advanced analysis required on large data sets. While this is true today for a minority of organizations and certain types of data and workloads, the truth is that it’s often a solution looking for a problem, and only of many solutions in those organizations that utilize these technologies. Companies, such as Yahoo and Facebook, who have been leaders in open source, NoSQL projects also leverage relational databases for structured, ad hoc, user-facing analytics. Like most of what we find in the BI world, there is no a one-size-fits-all solution since analytic workloads tend to vary depending on the type of analysis, data, and data volumes involved. What all the roundtable participants agreed with was that you can’t decide on a database solution until you understand your true data volumes and analytics workload requirements. At this stage, it’s best to start out with a proof of concept before deciding to go down this path.

There is no doubt that data volumes are increasing, more semi-structured and unstructured data are being analyzed, and analytic requirements are changing faster than ever. Our ability to meet these requirements is altering the way we approach the solution. A new breed of data storage and retrieval solutions has been developed that add options for our data warehousing architectures. The Bloor Group roundtable of analysts agreed that before jumping in with both feet, organizations should evaluate their analytic requirements, data workloads, willingness to implement relatively immature software, and available skill sets to support these solutions.  While the non-relational solutions can provide extreme scalability and flexibility for organizations struggling with big data challenges, they are not a one-size-fits-all solution for all data management and analytic needs.

About the Author: Steve Dine is the managing partner and founder of Datasource Consulting, LLC. He has extensive experience delivering and managing successful, highly scalable and maintainable data integration and business intelligence solutions. Steve combines hands-on technical experience across the entire BI project lifecycle with strong business acumen. He is the former Director of Global Data Warehousing for a major durable medical equipment manufacturing company and currently works as a consultant for Fortune 500 companies. Steve is a faculty member at The Data Warehouse Institute and a judge for the Annual TDWI Best Practices Awards. He teaches courses and presents on the topics of Lean BI, BI in the Cloud and Enabling BI for the 21st Century. For more information or to contact Steve, he can be reached via email at [email protected].

 

 

Leave a Comment

Previous post:

Next post: