All posts by KIm Loughead

Concurrent Tool Optimizes Hadoop Big Data App Performance

Concurrent Tool Optimizes Hadoop Big Data App Performance
Christopher Tozzi, The VAR Guy
February 4, 2014
http://thevarguy.com/big-data-technology-solutions-and-information/020414/concurrent-tool-opti

Another Big Data application designed to simplify data analysis using the open source Hadoop platform has hit the channel. Concurrent has announced the release of Driven, a free cloud service that it describes as “the industry’s first application performance management product for Big Data applications.”

Driven is designed to help developers and data analysts optimize the performance of Big Data applications by providing application metrics in real time. It also offers visualization features for diagnosing application problems as they occur. Insights such as these enable administrators to find inefficiencies and tweak performance, with the end result of shortening the time required to reap the results of Big Data operations.

Driven runs on top of Concurrent’s Big Data application framework, Cascading, which has 6,000 deployments and sees more than 130,000 downloads each month, according to the company. Cascading users can download the Driven plug-in for testing now from the platform’s website. Concurrent promises general availability of the product, along with “additional operations features and deployment options,” in Q2 of 2014.

The product is another example of both the maturation of the open source Hadoop Big Data platform, which now enjoys widespread adoption, and demand for solutions that simplify Hadoop deployment. It’s a natural addition to Concurrent’s suite of Hadoop products, which also includes a variety of programming interfaces that allow data analysts to use data languages they probably already know well—such as SQL in the case of Cascading Lingual, which Concurrent released in November 2013—to connect to Hadoop.

Concurrent is currently offering Driven, which is a cloud-based service, for no cost for development purposes. It will require a license for production operations.

Concurrent Offers Performance Management for Big Data Applications

Concurrent Offers Performance Management for Big Data Applications
Thor Olavsrud, CIO Magazine
February 4, 2014
http://www.cio.com/article/747673/Concurrent_Offers_Performance_Management_for_Big_Data_Applications

CIO — As big data applications move from development and proof of concept to production, the need for management and operations tools is becoming ever more pronounced. Enter big data application platform company, Concurrent, primary sponsor of the open source Cascading application framework.

Concurrent today announced Driven, which the company says is the first application performance management product for big data applications. Chris Wensel, author of the Cascading project and founder and CTO of Concurrent, says Driven is purpose-built to address the pain points of enterprise application development and application performance management on Apache Hadoop.

“Driven is a powerful step forward in delivering the full promise of connecting business with big data,” Wensel says. “Gone are the days when developers must dig through log files for clues to slow performance or failures of their data processing applications. The release of Driven further enables enterprise users to develop data-oriented applications on Apache Hadoop in a more collaborative, streamlined fashion. Driven is the key to unlock enterprises’ ability to drive differentiation through data. There’s a lot more to come—this is only the beginning.”

Governance and compliance tools are among the features on tap for the future, Wensel says.

The idea behind Driven is to give developers, data analysts, data scientists and operations the capability to see key application metrics in real-time and thus allow them to isolate and resolve problems quickly.

Driven Helps Visualize, Diagnose and Resolve Big Data Failures
Driven is essentially a plug-in for Cascading, which is an application that sits atop Hadoop and works with all the major Hadoop distributions. Once installed, Driven immediately begins collecting telemetry data from your running Cascading applications. That includes users of the popular domain-specific languages (DSL) built on Cascading, including Scalding (Scala DSL on Cascading), Cascalog (Clojure DSL on Cascading), Lingual (ANSI SQL on Cascading) and Pattern (Predictive Model Scoring on Cascading). The telemetry data gives Cascading users the ability to visualize their data applications, diagnose and quickly resolve application failures and performance problems.

Concurrent says Driven will allow developers and enterprises to achieve the following:

  • Accelerate time to market. Driven reduces the time to application production with process visualization and monitoring capabilities, allowing you to quickly understand complex applications and data flows by drilling down into each application at runtime using a rich user interface.
  • Build reliable applications. With detailed insight into your data processing logic and algorithms, you can ensure they are executing properly. Driven surfaces key application metrics around each data process to provide insights around data accuracy.
  • Optimize application performance. Driven allows you to understand the performance and capacity of the applications running on your infrastructure by providing key application behavior metrics like data skew and runtime parallelization. You can also compare this information with historical data to trend application performance, both in development and in production.

Driven represents a first for Concurrent in other ways as well. Until now, all of Concurrent’s output has been open source. But Driven is a proprietary product on top of Cascading, Wensel explains. It comes in two flavors: Driven and Driven Enterprise.

Driven, available now in public beta is a free cloud service for development environments only. Concurrent will provide online support for Driven. Driven Enterprise, which Wensel says will be generally available in the second quarter of 2014, will require an annual subscription. It is intended for both development and production environments and will support both developers and operations. Driven Enterprise will be available via on-premise and in the cloud, and Concurrent will provide enterprise support for the product.

“The product itself will be closed,” Wensel explains. “Driven is an open service and freely available for use, but you don’t get the source code. We feel this is the most natural, non-conflicting way to monetize open source.”

“We want the initial offering to be online so we can iterate and learn,” he adds. “We’re putting out a basic initial set of features. We have a long list of things we’re going to want to add.”

Concurrent Driven: Big data application performance management

Concurrent Driven: Big data application performance management
Dan Kusnetzky, ZDNet
February 4, 2014
http://www.zdnet.com/concurrent-driven-big-data-application-performance-management-7000025936/

Concurrent’s founder and CTO, Chris Wensel, and CEO, Gary Nakamura, stopped by to both introduce their company; their leading product, Cascading; and their new application performance management product, Driven, that is designed to take the guesswork out of finding performance problems for Hadoop-based big data applications.

What is Cascading?
I have to admit that I was not aware of Cascading until this conversation. After the discussion, I spent some time familiarizing myself with the open source project and what it can do. Here’s how Concurrent describes Cascading:

Cascading is a Java application framework that enables typical developers to quickly and easily develop rich Data Analytics and Data Management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions.

I was impressed by the list of companies that use Cascading as part of their Big Data development efforts and by the fact that 75,000 copies of the software are downloaded monthly. Concurrent points out that over 6,000 data driven businesses, including Twitter, eBay, The Climate Corp and Etsy, use Cascading to develop Hadoop-based applications.

What is Driven?
Driven is an application performance management tool designed to help Cascading developers accelerate development, diagnose performance problems, and both manage and monitor Hadoop-based Big Data applications.

Concurrent’s focus
Concurrent points out that they are addressing three things:

  1. Making Hadoop development straightforward enough that it can be a tool enterprises of all sizes can use to become data-driven companies. Without Cascading, many would have to resort to programming Big Data applications in assembler.
  2. Offering tools that offer alerting and notifications for Big Data applications so that they can be folded into production environments.
  3. Low cost tools (free for development, modest cost for production environments) that make it possible for companies to use Big Data.

Snapshot Analysis
Companies have accumulated huge amounts of operational, telemetric and point of sale data that could be the basis for a better and deeper understanding of their own operations and customers. Hadoop, while a very popular tool, can be challenging to a newcomer. Tools such as Cascading and Driven could certainly shorten the time it takes for developers to come up to speed and be productive with Hadoop.

Application Performance Management (APM) For Big Data

Application Performance Management (APM) For Big Data
Mike Matchett, Taneja Blog
February 4, 2014
http://tanejagroup.com/news/blog/blog-systems-and-technology/application-performance-management-apm-comes-to-hadoop

Concurrent, the folks behind Cascading, have today announced the beta of “Driven” – an Application Performance Management (APM) solution for Hadoop. APM has been sorely missing from the Hadoop ecosystem at a level in which developers, IT ops, and even end users can quickly get to the bottom of any issues.

First, if you don’t know about Cascading – one of the big impediments to using Hadoop on a wider scale is the need for programmers to work Map Reduce algorithms at a low, detailed, and often mind-bending level. Cascading is a popular open source package that provides a higher level of abstraction on top of Hadoop. Instead of working with mappers and reducers, you can work with more commonly understood higher-level objects like sources and sinks, functions, filters and joins. There are others of course, like Pig, but Cascading is designed for super reliability in production at scale.

Yet when anything goes wrong in Hadoop at scale, it can be hard to figure out (to say the least!). Especially in production when there are service level agreements in play which start the clock ticking on resolution. Downtime or degradation for big data “batch” apps and even more for the newer big data “streaming” apps, can cost big bucks. So Concurrent saw an opportunity to further leverage their app platform with a plug-in to feed detailed instrumentation into a management service.

Driven is at first a free-for-development cloud service that deploys quickly into any Cascading implementation and will easily help track and gain immediate insight into “enterprise-grade” apps. Commercial production and on-prem versions should be available once it gets into GA.

As a service, Driven monitors all the running apps and processes, and tracks successes and failures with all the expected alerts and notifications. Visualizations are interesting maps with a detailed “high fideltiy” view into the app components, highlighted of course where failures occur, with direct drill down into exceptions and stack traces. This is just a first cut though, lots of advanced analysis could easily be layered on later.

According to Concurrent, the community collaboration will be a key value for folks too, with Driven being integrated into the Cascading community web site.

While “free-for-development” Driven seems a no-brainer for Cascading developers, and should help to drive Cascading adoption even faster, we expect the big impact of Driven’s APM to really be with IT and Dev operations folks who have to manage large data processing solutions in production. APM is sorely needed in the big data for enterprise production, and we think Driven has a good chance of expanding over time to give Concurrent a real foothold in the enterprise management space. And on the backend Concurrent will gain instant broad and deep visibility into how the platform is doing in the field, helping improve Cascading.

Concurrent Releases Driven, First Big Data Application Performance Management Solution

Concurrent Releases Driven, First Big Data Application Performance Management Solution
Arnal Dayaratna, Ph.D., Cloud Computing Today
February 4, 2014
http://cloud-computing-today.com/2014/02/04/845844/

Concurrent Inc., the primary sponsor behind Cascading, today announces the release of Driven, an application performance management solution for Big Data applications. Driven enables developers to quickly identify and remediate application failures and performance issues specific to applications built using Hadoop. Available as a plug-in for the Cascading infrastructure, Driven solves a key problem in the Hadoop industry related to the management of Hadoop-based applications. The use of Driven allows developers to confirm the successful execution of application jobs and data processing algorithms, in addition to facilitating the optimization of application performance. Developers can monitor and trend application metrics such as runtime parallelization for both operational and R&D purposes. Moreover, because Driven is part of the Java-based Cascading framework for building analytics and data management applications on Apache Hadoop, Driven users can take advantage of Cascading’s collaboration functionality to communicate with Driven communities all over the world.

Chris Wensel, founder and CTO, Concurrent, Inc., remarked on the significance of Driven as follows:

Driven is a powerful step forward in delivering on the full promise of connecting business with Big Data. Gone are the days when developers must dig through log files for clues to slow performance or failures of their data processing applications. The release of Driven further enables enterprise users to develop data oriented applications on Apache Hadoop in a more collaborative, streamlined fashion. Driven is the key to unlock enterprises’ ability to drive differentiation through data. There’s a lot more to come – this is only the beginning.

Here, Wensel notes the way in which Driven responds to the opacity of Hadoop by providing developers with an alternative to sloughing through volumes of log files to understand the performance of their applications. Concurrent CEO Gary Nakamura elaborated on Wensel’s remarks by noting that “One of the big problems in Hadoop today is it’s just a black box,” and that Driven provides a way to expeditiously navigate to lines of code that are responsible for application failure. Because of its positioning as part of the Cascading infrastructure, Driven stands to significantly enhance the value of Cascading by providing developers with an extra layer of insight into application performance that complements Cascading’s indigenous framework for big data analytics and data management. Expect Driven to vault the status of Cascading within the Big Data industry even further and ultimately confirm its place as the go to application for Hadoop analytics, data and application management. Driven is currently available in public Beta whereas its commercial variant, Driven Enterprise, will be available in Q2 via an annual subscription.

Shining a Light on Hadoop’s ‘Black Box’ Runtime

Shining a Light on Hadoop’s ‘Black Box’ Runtime
Alex Woodle, Datanami
February 4, 2014
http://www.datanami.com/datanami/2014-02-04/shining_a_light_on_hadoop_s_black_box_runtime.html

Let’s face it: Writing MapReduce processes is not very fun. That’s the main reason that the Cascading framework is gaining such a big following–because it abstracts away the difficult part of MapReduce with an easy-to-use Java API and library. With today’s launch of a new product called Driven, the company behind Cascading is enabling users to instrument the data analytic apps developed with Cascading, in pursuit of faster troubleshooting and higher performance.

There is some serious momentum building up behind Cascading. According to Concurrent–the commercial open source company founded by Cascading creator Chris Wensel to sell support for Cascading–the open source framework is being downloaded 130,000 per month. What’s more, 6,000+ companies have deployed Cascading-built applications on production Hadoop clusters, including big names like Twitter, Kohl’s, and Nokia.

The way that Cascading allows mortal Java developers with average skills to build MapReduce-based applications that would normally require a super Java coder to construct has made Cascading a staple component of many Hadoop projects. “Being a Java API, the average Java developer can use it,” Wensel tells Datanami. “They can write tests and use their IDE. But also more importantly, they can think about the problem at hand and they don’t have to think in terms of MapReduce, MapReduce, MapReduce.”

While Cascading has helped many organizations build data analytic apps that run on Hadoop, the framework doesn’t address the overall lack of visibility into the inner-workings of Hadoop apps once they’re placed into production.

“One of the big problems in Hadoop today is it’s just a black box,” says Concurrent CEO Gary Nakamura. “Most people today deploy their applications and pray. What we’re doing [with Driven] is providing the visibility so you can actually see what’s going on, and if there’s a failure, we’ll take you to the exact spot that failure happened, so a developer can try and figure out what to do.”

From its GUI, Driven will show users exceptions and track traces in their Hadoop app, and track all the filters, joins, and other functions that are taking place within the software. “You’ll be able to see all of the details in your data application, the units of work and how it all ties together,” Nakamura says. “You’ll be able to see them in real-time, running on Hadoop, and see how your application is progressing.”

Nakamura says the software will help users, operators, and developers collaborate on improving their Hadoop applications–not only with broken apps, but with the working apps that could use a little optimization.

“We expect Driven to provide the capability to build more reliable applications,” he says. “Developers and operators will be able to look at those things and say, ‘Hmmm, we should have somebody take a look at this because everything else takes 5 minutes and this takes 25 minutes. We ought to be able to optimize that down to something more reasonable.'”

As Cascading grows in use, so will Driven, Nakamura syas. “The next version [3.0] of Cascading will support other fabrics like Spark, Storm, and Tez,” he says. “So that means applications that have been built using Cascading will be portable across the supported frameworks, across the supported fabrics.” As these Cascading-developed applications start moving to different fabrics, Driven will follow and provide the same type of troubleshooting and optimization capabilities.

The first release of Driven will focus on helping developers monitor, debug, and set alerts on their Hadoop apps. Later, Concurrent will add more operational capabilities to their Hadoop jobs, including setting service level agreements (SLAs), enforcing quality of service (QoS), and ensuring the integrity of data lineage.

The first beta release of Driven is available now as a cloud service. The service is free for development use. A paid enterprise version is in the works that will support production use and be installable on-premise; it’s expected in the second quarter.

Driven supports Cascading version 2.08 (it’s currently at version 2.5) and includes popular domain specific languages like Lingual (ANSI SQL), Pattern (PMML), Scalding (Scala), and Cascalog (Clojure).

Concurrent, Inc. Delivers the First Application Performance Management Product for Big Data Applications

A Free Cloud Service, Driven Delivers Unmatched Visibility and Control to Big Data Applications

SAN FRANCISCOFeb. 4, 2014Concurrent, Inc., the enterprise Big Data application platform company, today announced Driven, the industry’s first application performance management product for Big Data applications. Driven is purpose-built to address the pain points of enterprise application development and application performance management on Apache Hadoop™.

There is strong demand for a solution that helps Hadoop users quickly get to the bottom of their data application failures and performance problems. Driven enables developers, data analysts, data scientists and operations to see key application metrics in real-time, so that they can isolate and resolve problems immediately. The end results are stable, highly reliable applications that enterprises can depend on to deliver against their Big Data strategies.

Driven provides unprecedented visibility and control to Big Data applications developed, managed and deployed on Hadoop. Driven is a free cloud service and is an integral part of the Cascading community, where users can collaborate across organizations and get help from community experts.

Driven enables developers and enterprises to:

  • Accelerate Time to Market
    Dramatically reduce the time to application production with process visualization and monitoring capabilities. Quickly understand complex applications and data flows by drilling-down into each application at runtime through a rich user interface to accelerate your test-driven development cycle.
  • Build Reliable Applications
    Gain detailed insight into your data processing logic and algorithms, and ensure that they are executing properly. Key application metrics around each data process are surfaced to provide insights around data accuracy.
  • Optimize Application Performance
    Understand the performance and capacity of the applications running on your infrastructure. Key application behavior metrics, such as data skew and runtime parallelization, provide insights to application behavior. Also, compare with historical data to trend application performance whether in development or in production.

Cascading Users Rejoice

With more than 130,000 monthly downloads and 6,000 deployments, Cascading is the platform of choice for the development and deployment of Big Data applications. All Cascading users can leverage Driven to optimize the development and deployment of their Cascading applications, including users of popular domain-specific languages (DSL) built on Cascading, i.e. Scalding (Scala DSL on Cascading), Cascalog (Clojure DSL on Cascading), Lingual (ANSI SQL on Cascading) and Pattern (Predictive Model Scoring on Cascading).

Cascading users can drop in the free plug-in in minutes by visiting http://www.cascading.io. Once installed, Driven immediately begins collecting telemetry data from your running applications, enabling Cascading users to visualize their data applications, diagnose and quickly resolve application failures and performance problems. Cascading and Driven together deliver a one-two punch that knocks out the complexity in Big Data application development.

Supporting Quotes

“We build Big Data applications on Cascading and rely on these applications to run our service. Driven is a much-needed addition to deploying and managing our Cascading applications on a day-to-day basis.”

-David Amusin, CTO, Copilot

“Given the rapid adoption of Hadoop, we fully embrace partners that are focused on helping enterprises develop Big Data applications on Apache Hadoop in a more collaborative, streamlined fashion. We look forward to the success, control and visibility that Driven will bring to Hadoop deployments in the enterprise.”

-John Kreisa, vice president strategic marketing, Hortonworks

“Driven is a powerful step forward in delivering on the full promise of connecting business with Big Data. Gone are the days when developers must dig through log files for clues to slow performance or failures of their data processing applications. The release of Driven further enables enterprise users to develop data oriented applications on Apache Hadoop in a more collaborative, streamlined fashion. Driven is the key to unlock enterprises’ ability to drive differentiation through data. There’s a lot more to come – this is only the beginning.”

-Chris Wensel, founder and CTO, Concurrent, Inc.

Availability and Pricing

Driven will be available in public beta this week at http://www.cascading.io/driven and is offered as a free cloud service as part of the Cascading community.

Driven is free for development. To request a quote for a production license or to evaluate Driven in your data center, contact sales@concurrentinc.com. Driven will be generally available in Q2 2014, with additional operations features and deployment options.

Supporting Resources

About Concurrent, Inc.

Concurrent, Inc. delivers the #1 application development platform for Big Data applications. Concurrent builds application infrastructure products that are designed to help enterprises create, deploy, run and manage data applications at scale on Apache Hadoop™.

Concurrent is the team behind Cascading™, the most widely used and deployed technology for Big Data applications with more than 130,000+ user downloads a month. Used by thousands of businesses including Twitter, eBay, The Climate Corp and Etsy, Cascading is the de-facto standard in open source application infrastructure technology.

Concurrent is headquartered in San Francisco and online at http://concurrentinc.com.

Media Contact

Danielle Salvato-Earl
Kulesa Faul for Concurrent, Inc.
(650) 922-7287
concurrent@kulesafaul.com

MeetUp | Etsy’s journey: JRuby to Scalding – Feb 25, 2014

Sign-up here: http://meetu.ps/28L6C0

When:
Tuesday, Feb 25, 2014
6:00 PM (PT)

Where:
Etsy
20 California St, Floor 3
San Francisco, CA 94111

Etsy’s journey: JRuby to Scalding. What happens when a technology chooses you

Come hear Dan McKinley, Principal Engineer at Etsy, talk about his journey from JRuby to Scalding.

After 3 years building features and analytics infrastructure in cascading.jruby, the framework (similar in most ways to Pig) was entrenched. Etsy had just finished migrating their EMR pipeline onto new internal hardware, and standardizing the development environment for their 150+ person product development team. It was at this moment that the Scalding grenade hit. Introduced using guerrilla tactics, within a few months Scalding had been widely adopted. Within a year, cascading.jruby was deprecated. The talk will cover Etsy’s story, the technical problems that precipitated it, and the general unease implied when your technology chooses you.

Concurrent, Inc. Announces Entry Into the Rackspace Partner Network

Collaboration Combines Power of Cascading with the Rackspace Hybrid Cloud to Deliver Flexibility and Simplicity for Hadoop Deployments

SAN FRANCISCO – Jan. 28, 2014Concurrent, Inc., the enterprise Big Data application platform company, today announced that it has joined the Partner Network for Rackspace® Hosting, (NYSE: RAX), the open cloud company. By joining the Rackspace partner network, Concurrent allows users to easily deploy Big Data applications at scale on the Rackspace Hybrid Cloud, powered by OpenStack. This channel partnership combines the power of Cascading, the most widely used and deployed application platform for building robust enterprise Big Data applications, with Rackspace’s managed cloud hosting expertise to drive enterprise Big Data application development on Apache Hadoop™.

Cascading is the enterprise development framework of choice for processing data sets on Hadoop, with more than 130,000 downloads a month. Cascading simplifies programming, workflow and data processing, allowing enterprises to cut time to production by 50 percent. As a result, enterprises can more easily execute on their Big Data strategies by leveraging existing skillsets, systems and tools – all while reducing the overall complexity of their technology infrastructure. Cascading is compatible with all popular Hadoop distributions, including Hadoop 2 and YARN.

As more businesses seek to leverage their data assets, there is an ever-growing need for cost-effective, scalable Big Data platforms. Enter Rackspace hybrid cloud platform, which provides developers, architects and data scientists with a flexible solution for easily developing, deploying and managing Big Data applications at scale. Enterprises can now tap into the power of Cascading along with Rackspace’s high-performance and cost-effective hosted cloud to further drive business differentiation through data.

Supporting Quotes

“Rackspace is proud to welcome Concurrent as a select technology partner, especially given the incredible popularity of Cascading. We look forward to furthering enterprise Big Data adoption by jointly providing the easiest, most cost-effective and reliable means for Hadoop deployment.”

-Chris Rallo, Senior Manager, Partner Strategy and Programs, Rackspace

“Using the right tools is critical for ensuring the success of any Big Data project. By partnering with Rackspace, Concurrent continues our dedication to easy Big Data application development for the masses. Rackspace’s hybrid cloud platform provides enterprises with best-in-breed capabilities to make the most of their Cascading Big Data application projects on a scalable, high-performance framework.”

-Gary Nakamura, CEO, Concurrent, Inc.

“Combining the power of Cascading with Rackspace’s hybrid cloud platform provides the right recipe for developing solutions fast while ensuring easy deployment at scale. AdMobius is committed to providing our customers with the very best solutions to make sense of their data, and this partnership helps ensure that we do just that.”

-Ray Duong, CTO, AdMobius

Supporting Resources

About Concurrent, Inc.

Concurrent, Inc. is the enterprise Big Data application platform company. Founded in 2008, Concurrent simplifies Big Data application development, deployment and management on Apache Hadoop. We are the company behind Cascading, the most widely used and deployed technology for building Big Data applications with more than 130,000 user downloads a month. Enterprises including Twitter, eBay, The Climate Corporation, Square and Etsy all rely on Concurrent’s technology to drive their Big Data deployments. Concurrent is headquartered in San Francisco. Visit Concurrent online at http://concurrentinc.com.


Media Contacts
Danielle Salvato-Earl

Kulesa Faul for Concurrent, Inc.
(650) 922-7287
concurrent@kulesafaul.com

Why Hadoop Only Solves a Third of the Growing Pains for Big Data

Why Hadoop Only Solves a Third of the Growing Pains for Big Data
Gary Nakamura, CEO, Concurrent, Inc.
January 27, 2014
http://insights.wired.com/profiles/blogs/why-hadoop-only-solves-a-third-of-the-growing-pains-for-big-data

Apache Hadoop has addressed two of the growing pains organizations face as they attempt to make sense of larger and larger sets of data in order to out-innovate their competition, but four more remain unaddressed. The lesson: Just because you can avoid designing an end-to-end data supply chain when you start storing data doesn’t mean you should. Architecture matters. Having a plan to reduce the cost of getting answers and simultaneously scale its utility to the broader organization means adding new elements to Hadoop. Fortunately, the market is addressing this need.

The Need for a Single Repository for Big Data

Traditional databases were not the repository we needed for big data, 80 percent of which is unstructured. Hadoop offered us, for the first time, the ability to keep all the data in a single repository, addressing the first big data growing pain. Hadoop is becoming a bit bucket that can store absolutely everything: tabular data, machine data, documents, whatever. In most ways, this is a great thing because data becomes more valuable when it is combined with other data, just like an alloy of two metals can create a substance that is stronger and more resilient. Having lots of different types of data in one repository is a huge long-term win.

No Standard Way to Create Applications to Leverage Big Data

Once we have all the big data in a single repository, the next trick is to create applications that leverage that data. Here again Hadoop has done a fine, if complicated, job. MapReduce provides the plumbing for applications that can benefit from a parallel divide-and-conquer strategy to analyze and distill data spread over the massive Hadoop File System (HDFS). YARN, a result of the refactoring of Hadoop 1 introduced in Hadoop 2, formalizes resource management APIs and allows more parallelization models to be used against a given cluster. This is all great, but it leaves one huge problem: Most of the people who have questions have no way to access the data in Hadoop. A financial analyst at Thomson Reuters or a buyer at Bloomingdales wants answers that their data can provide, but in the main these folks don’t know how to write programs that access Hadoop; it’s not their expertise.

What this leaves us with is a situation that is mighty familiar to people who have worked in large organizations that have implemented traditional Business Intelligence stacks. Questions are plentiful and programs that can help answer those questions are scarce. Pretty soon, there’s a huge backlog that only programmers can help with. All that wonderful stored data is longing to answer questions. All those analysts and businesspeople want answers. The complexity of creating applications hinders progress.

Most companies get stuck at this point, which I call the “dark valley of Hadoop.” The way out is to find a way to address the remaining growing pains.

The Gap between Analysts and Big Data

The third growing pain is to close the gap between the analysts and the data. Splunk’s Hunk is perhaps the most promising technology to deliver a true interactive experience. Especially powerful are Splunk’s capabilities for discovering the structure of machine data and other unstructured data on the fly. My view is that Drill and Impala are addressing a similar need. But it is a mistake to think of this interactivity as closing the entire gap because it’s really the missing productivity that must be addressed.

To accelerate the pace of creating Hadoop applications, comfortable and powerful abstractions must be delivered to power users, analysts, and developers that allow them to express the work of the application at as high a level as possible. For that abstraction to work, it must mean that you can create applications without being a Hadoop expert. For power users and analysts, domain-specific languages like Cascalog allow an application to be built by expressing a series of constraints on the data. Concurrent’s Lingual project allows applications to be expressed as SQL and then translated into Hadoop jobs. For developers, Concurrent’s Cascading library allows the application to be expressed in terms of APIs that hide the details of Hadoop from power users. In addition to Hunk, Pentaho, Alpine Data, Teradata Aster, and Pivotal all offer higher level ways to design and build and applications. Effective higher level abstract methods are crucial to productivity.

In other words, the way to address this growing pain is to hide the complexity of Hadoop so that analysts can get work done without having to become Hadoop experts.

Problems in Processing Big Data

Big data is almost always turned from a raw metal to a valuable alloy through a process that involves many steps. To continue the metaphor, raw metal must be refined before it is forged. This is costly and it is where Hadoop excels. Analysts can ask and answer certain questions using an interactive system, but the data must be cleansed beforehand, resulting in a complex upstream workflow. Typically these workflows are built in a brittle fashion and are difficult to test and debug. Most frequently, the workflow itself is applied to a new dataset where the results are a machine-learning model or a set of analytics. Any benefit to loading and subsequent indexing in interactive tools is lost in these cases.

The fourth growing pain is the ability to manage the execution of a cascading chain of workloads or applications. This can be tricky for many reasons. If you cobble together lots of applications in a complex chain using a wildly varying set of tools within arm’s reach, something data scientists love to do, you end up with a complicated workflow, or using a better phrase, a business data process. If something goes wrong in one of the intermediate applications, it is often impossible to figure out how to fix the process. You have to start the whole thing over. Tools such as Pig, Sqoop, and Oozie offer alternative ways to express the problem, but ultimately do not fix the underlying issue.
The right way to manage the cascading chain of data-oriented applications is, again, to hide the problem through abstraction. The higher level abstractions mentioned in growing pain 3 should be generating the complex of Hadoop jobs and managing their execution. Then through indirection it’s okay to combine a number of Hadoop applications or workloads in a chain, but when your business process depends on dozens of steps, all knit together by hand, something is bound to go wrong.

One Size Does Not Fit All

Today Hadoop is virtually synonymous with big data. In reality, Hadoop is not all things to all people nor is it all things to big data. Business requirements for more timely insight have introduced lower latency requirements as well as newer, more immature technologies to solve these problems and glean insight from sources such as real-time streaming data. As they face these requirements, users are cast back into the darkness, trying to once again make sense of when they should use these new tools as well as how they should glue them together in a comprehensive way to form a coherent big data architecture.

From Data for Business to the Business of Data

Looking down the road a bit, there is a progression in maturity for using data. Many organizations derive business insights from their data. Once you see that your data is actually transformative to your business, you have hit a new level of maturity. More and more organizations are realizing that they are actually in the business of data. At this point, the growing pain becomes acute, particularly if you see in horror that the tools you have chosen to assemble your data workflows are nothing but a house of cards.

The most sophisticated companies using big data, and you know who I mean, have been aggressive about solving all these growing pains. These companies, like yours, don’t want talented people struggling with Hadoop. Instead, they want them focused on the alchemy of data, seeing what the data has to say and putting those insights to work, and inserting them into their business processes.

The masters of big data have created a coherent architecture that allows the integration of new tools. They have created robust big data workflows and applications that support the transition from deriving insights from big data to being in the business of data. That’s the end game, the real value from big data that makes all of the growing pains worthwhile.