Category Archives: News

Cascading Now Supports Tez–Spark and Storm Up Next

Alex Woodie, Datanami
May 13, 2014
http://www.datanami.com/2014/05/13/cascading-now-supports-tez-spark-storm-next

Concurrent, the company behind the open source Cascading framework, today unveiled a major update that will allow its customers to migrate their Hadoop applications from using MapReduce to use the new Apache Tez engine, without rewriting any business logic. Spark and Storm are next up on Cascading’s radar, Concurrent CTO Chris Wensel tells Datanami.

Analysts have billed 2014 as the year that Hadoop grows up and takes on the enterprise. Anecdotal evidence suggests that big companies, indeed, are moving away from tire-kicking phase and investing in production systems.

However, while Hadoop may have traded in his open source apparel a suit and tie, that doesn’t mean that all technology questions have been settled. Everybody in the Hadoop world seems to agree that the batch-oriented nature of MapReduce is on its way out. But what’s going to replace it? Apache Tez? Apache Spark? Apache Storm? Apache next? Nobody knows.

“You gotta pick your poison,” Wensel says. “A lot of those technologies overlap with each other. But there are also tradeoffs. And this game is all about tradeoffs.”

With Cascading, Concurrent is ideally situated to help customers minimize the risk of making the wrong tradeoff. The Cascading product does this by presenting a layer of abstraction between the application developer and the complex Hadoop APIs. The product–which is free and being downloaded 150,000 times per month–allows developers to write their business logic once using the simpler Cascading APIs (available for Java, Python, Scala, and other languages), and deploy the application using whichever Hadoop data fabric meets their needs.

For the last six months, Wensel has been working on the heart of Cascading, the customizable query planner, to enable it to support Apache Tez. It was a very big job, and Wensel did much of this work in collaboration with Hortonworks, which is particularly bullish on the prospects of Tez as a replacement for MapReduce. The result of that collaboration is now available in Cascading 3.0.

According to Wensel, it’s all about giving customers the flexibility to pick the Hadoop fabric that best fits their needs. “We’re seeing Tez and other technologies that are slightly more complex [than MapReduce], but give you more degrees of freedom to do more interesting things at the computation level,” he says.

Tez represents a “massive improvement” in the Hadoop model, and Wensel is excited to see how users will respond to support for Tez, which should provide an immediate performance boost upwards of 50 percent compared to MapReduce.

What’s more, Cascading will also allow users to dial up the performance even higher if they want, but perhaps take on more risk of the code falling and breaking. “We can give you a conservative rule engine, for Tez, but as Tez matures, we can give you a more aggressive rule engine,” Wensel says. “If you want turn it up to 11, go for it, but it might blow out your speakers.”

Next up on Wensel’s plate are Apache Spark and Storm. The company today also announced a partnership with Databricks, the company behind Apache Spark. Wensel and company will set out this summer to enable Cascading apps to utilize Spark within Hadoop. Some of the work he did on supporting Tez will carry over to Spark, or at least make it somewhat easier to support, he says.

While Spark seems to have a lot of momentum at the moment, Wensel still sees a bit of risk with Spark still, and doesn’t seem entirely sold on it. “People want to try Spark. We get it,” he says. “People want us to port Cascading to Spark so they can see if it’s better. I don’t know anybody in production with Spark, but I don’t know anybody in production on Tez either.” The timeframe? “We’re definitely going to get to that as quickly as we can,” he says. “We hope to get to it this summer.”

The way Wensel sees it, nobody can predict what technology is going to win in the end. It could be Tez, or it could be Apache Spark. “What you don’t want is the risk of learning a new API or a language on a new API just to get the tradeoff to realize the tradeoff was a bad one,” he says. “What did you do? Spend six months figuring out that was a huge mistake.”

It’s all about weighing the tradeoffs, and allowing people to experiment with the various Hadoop fabrics to find out what works best for them and avoid those million-dollar mistakes. By allowing people to experiment with Tez, Spark, and MapReduce, Cascading will let developers make apples to apples comparisons among the various “Baby Bear, Mama Bear, and Papa Bear technologies,” as the colorful Wensel puts it.

“If Spark doesn’t scale, then they’ll go to Tez. But Tez might be slower,” Wensel says. “If they have a smaller application that doesn’t [need to] scale, maybe they could leave that on Spark. But they can make these decisions without having to rewrite their applications. If they’re okay with 2 percent of their jobs failing, then maybe they’ll pick a different technology that’s faster, but maybe it will fail more frequently. If they never can have it ever fail and they just need predictability, they may stick with [Papa Bear] MapReduce because it’s extremely stable and mature. People want to be able to make these choices. They don’t want just one technology.” Amen to that.

Big Data: How to “Write Once and Deploy” across Big Data Fabrics

Dick Weisinger, Formtek
May 13, 2014
http://formtek.com/blog/big-data-how-to-write-once-and-deploy-across-big-data-fabrics

Cascading is a Java-based framework for big-data/data-centric application development created and supported by Concurrent. The framework abstracts and hides complex implementation details involved in the writing of big data applications. Cascading is used by companies like eBay, Linkedin, the Climate Corp and Twitter. The framework has seen broad adoption across many different industry segments that include Finance, Telecom, Marketing, Entertainment, and Enterprise IT. We covered the 2.5 release of the framework last November.

Concurrent announced the release of the Cascading 3.0 framework today. This newest release expands the support for the underlying data engines which can be plugged into the framework. Cascading 3.0 will immediately support out of the box the local in-memory fabric, which has been in Cascading since the beginning. Apache MapReduce, which has been the default and the foundation on which Hadoop is been built, will be available too. Beyond that, the 3.0 release now also supports Apache Tez, and in the very near future support will be added for Apache Spark, followed by support for Apache Storm.

Gary Nakamura, Concurrent CEO, said that “what we’ve done in Cascading 3.0 is that we’ve allowed for data applications to execute on different fabrics. So essentially we’ve made Hadoop and MapReduce, and Spark and Tez an implementation detail of the framework. The enterprise can now develop these applications unencumbered without having to think about latency or scale, and then simply pick the modality on which they want to deploy their application.”

Cascading has become a top choice for building data-centric applications. Since mid-2012, the Cascading framework has seen 10 percent month-over-month growth. The number of software downloads has gone from roughly 20,000 per month to more than 150,000 downloads per month, and as a result, there are now more than 7000 deployments of the Cascading framework.

By using Cascading some of the risk of data-centric application development can be taken out of the equation. Cascading provides a stable API and framework on which to build data applications and, once it’s in place, then it’s possible to swap in or out whichever underlying data engine or fabric that you want to deploy on.

Nakamura commented that “for anyone developing on Cascading today, it will become very easy them to migrate their data-centric applications to new computation fabrics when the enterprise is ready to upgrade their Hadoop distribution. They could standardize on one API to solve a variety of business problems. ISVs can now leverage Cascading as the interface between their value-added solution and Hadoop or Spark or Storm without having to write directly to each of the different fabrics for the different modalities that they want to offer to their end customers. This translates to other data apps that are built on top of Cascading and they will benefit from this portability.”

What types of organizations are benefiting from using Big Data? Nakamura thinks that businesses which can see the value in Big Data and then deeply architect the use of it in their enterprise applications will benefit the most. ”End users that are just stopping at ad-hoc have a hard time having conversations with their bosses about budgets going forward. The ones that are building and operationalizing data inside of Hadoop and are standing up enterprise applications and consistently delivering data products to their end users” are seeing the rewards that can be derived with Big Data. ”It’s not necessarily a conversation about how much money we are making off of this data product or how much money are we saving because of it. It’s more transformative.”

Cascading 3.0 Adds Multiple Framework Support. Concurrent Driven Manages Big Data Apps

Boris Lublinsky, InfoQ
May 13, 2014
http://www.infoq.com/news/2014/05/driven

When it comes to implementing Big Data applications, companies today can choose from multiple frameworks ranging from Apache MapReduce to Apache Tez to Apache Spark to Apache Storm. Each one of these frameworks has its own advantages and drawbacks, and can be most appropriate for certain applications. Although it is possible to run all frameworks on a single Apache YARN cluster, each one of them has a slightly different programming model and a different set of APIs. This means that porting a given application from one framework to another might prove to be non-trivial.

Cascading 3.0, a new release of Concurrent’s flagship product, solves many of these issues. Cascading is one of the most popular Java Domain Specific Languages (DSL) initially introduced in late 2007 as a DSL to define and implement functional programming for large scale data workflows on top of the low-level MapReduce APIs. Cascading is based on a “plumbing” metaphor allowing to assemble data pipelines: supported high-level constructs allow it to split, merge, and join streams of data, and to perform operations on the streams.

Such an approach allows users to represent their Cascading applications as a directed acyclic graph (DAG), which is mapped by Cascading’s planner to the underlying framework, originally MapReduce.

Cascading 3.0 goes beyond MapReduce by allowing enterprise developers to build their data applications once, then run those applications on the framework that best meets their business needs. Cascading 3.0 will initially ship with support for: local in-memory, Apache MapReduce (support for both Hadoop 1 and 2 are provided), and Apache Tez. Soon thereafter, with community support, Apache Spark™, Apache Storm and others will be supported through its new pluggable and customizable planner.

A new planner introduced in Cascading 3.0 allows users to create rules that assert correctness of the graph and annotate the graph nodes with meta-data used at runtime based on local topology. The planner also allows for transformation of the graph in order to balance, insert, remove, or reorder nodes. It also partitions the graph to find recursively smaller sub-graphs that map to compute units (like a Map or a Reduce node, or a Tez process).

Once the appropriate compute units are defined, Cascading builds the execution configuration plan which leverages a framework specific jar (and Maven POM) that insulates all the framework’s APIs. Both the jar and POM are provided by Cascading.

The open pluggable architecture implemented by Cascading 3.0 provides easy extensibility of the product to support additional frameworks. This can be done by implementing a new set of rules for a given framework and framework-specific jar and POMs.

In addition to open source Cascading 3.0, Concurrent also recently announced its commercial product Driven, which provides real-time monitoring, operational control and performance management for Cascading applications. Driven provides a set of screens to support the following features:

  • Understand – Seeing a data app executing in real time and visually drilling down into each unit of work.
  • Diagnose – Quickly identifying failed (including failure reasons) and poorly performing applications.
  • Optimize – Visually breaking down vital application metrics to spot performance issues and anomalies.
  • Track – Viewing and comparing the history of application’s run-time performance.

The new products released by Concurrent will ease application migration to new computation frameworks like Apache Tez and other best of breed technologies. This allows enterprises to standardize on a single API to meet business challenges and solve a variety of business problems and introduce new more suitable Big Data frameworks without massive application rewrites. Driven provides more operational visibility of new and existing Big Data applications, from development to production.

Concurrent Announces Upgrade to Cascading Big Data Development Framework

David Ramel, Application Development Trends
May 13, 2014
http://adtmag.com/Articles/2014/05/13/cascading-upgrade.aspx

Concurrent Inc. today announced an upgrade of its Cascading development framework for building Big Data applications, offering more choices for working with the Apache Hadoop ecosystem.

Cascading 3.0, due early this summer, features a new pluggable query planner that can be customized for working with different technologies, basically offering alternatives to using the problematic MapReduce programming model, sometimes referred to as an execution fabric.

MapReduce was an original integral part of the Hadoop system of Big Data programming, handling the compute function for working with data stored in the Hadoop Distributed File System (HDFS). Notoriously complex and inflexible, the limitations of MapReduce reportedly helped prompt Chris Wensel to found Concurrent to provide more options and simplify Big Data programming with Hadoop.

One of those options in Cascading 3.0 is the ability to work with the emerging Apache Tez project, an alternative application framework that takes advantage of improvements to the Hadoop ecosystem such as Apache Hadoop YARN that came with Hadoop’s upgrade to version 2. YARN is sometimes referred to as Yet Another Resource Manager and described as a tool to separate resource management from processing components.

“For existing users, they will be able to migrate their existing applications from Hadoop 1 or 2 MapReduce to Apache Tez trivially,” Wensel told this site. “Or any other new fabric that the community adds support for.

“And, as Tez matures and gains features, users can create or use new rules in our query planner to experiment with different features or optimizations in Tez,” said Wenzel, the company’s CTO. “The same will be true with other fabrics.”

Cascading 3.0 will ship with support for both MapReduce and Tez and will feature local in-memory computing. Concurrent said soon after release, community support for the open source software is expected to add the capability for the new pluggable and customizable query planner to work with other alternative technologies such as Apache Spark and Apache Storm.

Concurrent said it designed Cascading 3.0 to help developers build data applications once and then run those applications on the most appropriate fabric, providing flexibility to solve business problems of varying complexity, regardless of latency or scale.

“In the same way people have adopted Spring and J2EE for container and service-oriented application development, Cascading is used by enterprises to develop data-oriented applications,” Wensel told this site.

Concurrent, which just a few weeks ago announced a partnership with major Big Data player Hortonworks, today also announced a partnership with Databricks that will let Cascading work with the Apache Spark processing engine.

Databricks, founded by the creators of Apache Spark — recently upgraded to a top-level Apache project — builds software for analyzing data and extracting value. Spark is an open source processing engine — or data analytics cluster computing framework — that speeds up Big Data analytics through the use of in-memory computing and other means.

“One of our primary goals is to drive broad adoption of Spark and ensure a great experience for users,” said Databricks CEO Ion Stoica in a statement. “By partnering with Concurrent, all the developers who already use Cascading will be able to deploy their applications on Spark, while Spark users benefit from direct access to all of the benefits of Cascading and Driven. We are committed to open source and partnering with proven market leaders like Concurrent to drive new growth and innovation in the Big Data community.”

Driven is Concurrent’s flagship commercial offering designed to enhance the development and management of data applications for enterprises.

Cascading 3.0 Future-Proofs Data-Centric Application Development on Hadoop

Joyce Wells, Database Trends and Applications
May 13, 2014
http://www.dbta.com/Editorial/News-Flashes/Cascading-30-Future-Proofs-Data-Centric-Application-Development-on-Hadoop-96955.aspx

Concurrent, Inc., the company behind Cascading, an open source application development framework for building data applications on Hadoop, has announced Cascading 3.0, which CEO Gary Nakamura says will give enterprises the flexibility to build their data-oriented applications on Hadoop once, and then run the applications on the platform that best meets their business needs.

Cascading is very focused around reliable and reusable tools to build data products but also give users with varying skill sets the freedom to solve problems, said Nakamura.

“Cascading 3.0 will allow applications to execute on whatever fabric that we support, and that end users want to run on, through our new query planner – that means that an application that was written 2 years ago and that solves a particular business problem for an end user can very quickly be migrated over to a newer fabric like Apache Tez or Apache Spark,” Nakamura said. “Enterprise users can write an application once and deploy on whatever fabric they would like depending on what the business problem is.”

The added migration flexibility is critical for the Hadoop community, says Nakamura. For existing customers, it means ease of migration to new computation platforms with very little effort. Longer term, it is important for mainstream adoption because the rapid innovation that is happening inside of Hadoop is causing some enterprises to sit on the sidelines, concerned that it is too complex, and if they build an application now and a platform is changed, they will have to do a complete rebuild.

“What we are providing is a standard way to develop data-centric applications without the risk of having to rewrite those applications when distributions or the providers of the computation engines underneath it change direction one day.”

Cascading 3.0 will ship with support for local in-memory, Apache MapReduce, and Apache Tez. Shortly after, support will be added for Apache Spark, Apache Storm and others through the new pluggable and customizable query planner.

Third-party products, data applications, frameworks and dynamic programming languages built on Cascading will benefit from this portability, according to the company. Cascading offers compatibility with all major Hadoop vendors and service providers including Altiscale, Amazon EMR, Cloudera, Hortonworks, Intel, MapR and Qubole, as well as others.

Cascading is used by enterprise Java developers first and foremost, but Concurrent offers interfaces that allow users working with R, MicroStrategy, or SAS to take their predictive models and deploy them on Hadoop. “We also have a SQL interface so SQL end users and anyone who knows how to program with SQL can leverage Cascading,” said Nakamura.

Concurrent also recently announced strategic industry partnerships with Hortonworks and Databricks, and new product innovation with the introduction of Driven, its flagship product that provides application performance management for data-centric applications from development through production.

Cascading version 3.0 will be available in early summer and freely licensable under the Apache 2.0 License Agreement. Concurrent also offers standard and premium support subscriptions for enterprise use.

With Cascading 3.0, application developers can operationalize the Hadoop ecosystem

Maria Deutscher, Silicon Angle
May 13, 2014
http://siliconangle.com/blog/2014/05/13/with-cascading-3-0-application-developers-can-operationalize-the-hadoop-ecosystem

Concurrent, an up-and-coming startup working to simplify the creation of data-driven applications, has pulled the curtains back on a revamped version of its flagship development framework that facilitates integration across the full spectrum of technologies in the Hadoop ecosystem to enable an entirely new set of use cases.

Cascading, as the San Francisco-based company’s software is called, is available under an Apache license and serves as an abstraction layer between Hadoop and the applications using it, shielding developers from the inherent complexity of MapReduce. The third release, introduced this morning, extends that simplicity to the dozens of complementary open source components available for the batch processing platform in order to make the capabilities of those tools more accessible to enterprise applications.

The new functionality represents a major step forward towards Concurrent’s vision for democratizing analytics, which is founded on the classical notion that business logic must to be decoupled from the code that handles information.

“Building applications on top of Hadoop was very difficult. That’s why our founder Chris Wensel created a framework so you could have a separate business logic layer from the data layer, and it’s written in Java so any Java programmer can pick it up,” Guy Nakamura, the CEO of Concurrent, explained to SiliconANGLE in an exclusive interview on theCUBE during O’Reilly Fluent Conference 2013.

“The requirement for the enterprise is not to learn new skills for Hadoop but to leverage existing skills, existing systems and existing investments they already made in their infrastructure,” Nakamura added. Cascading now delivers that abstraction for the various specialized tools in the Hadoop ecosystem as well through both direct and indirect support.

Out-of-the-box, the framework is compatible with Tez, a distributed execution engine that offers superior performance to MapReduce with lower latency, a combination that is especially useful for fast-moving streaming workloads such as sensory data. Other technologies can be plugged into Cascading utilizing a new built-in query planner that Concurrent said will be used to add support for two additional Apache projects in the near future.

One of the items on the list is Spark, a separate implementation of the concepts detailed in the 2007 Microsoft research paper Tez is based on that is generally considered more mature and better suited for production as a result. The firm said that Cascading will eventually also work with Storm, a third real-time processing framework that was open sourced after Twitter acquired original developer BackType.

Of particular note is that, through the integrations and a new local caching function, the latest version of framework allows for in-memory processing. That is significant because, as Wikibon co-founder and chief analyst Dave Vellante explained in recent segment on theCUBE, eliminating the overhead associated with retrieving information from disk can improve application performance by several orders of the magnitude, removing the I/O limitations that have historically prevented developers from taking full advantage of their data.

Cascading Allows Apps to Execute on All Big Data Fabrics

Thor Olavsrud, CIO Magazine
May 13, 2014
http://www.cio.com/article/752747/Cascading_Allows_Apps_to_Execute_on_All_Big_Data_Fabrics

Concurrent says Cascading 3.0 will support local in-memory, Apache MapReduce and Apache Tez out of the gate with support for Apache Spark and Apache Storm soon to follow.

CIO — Organizations are increasingly focusing on building enterprise data applications on top of their Hadoop and NoSQL infrastructure. But even as that’s happening, Hadoop itself is becoming much more diverse and complex. That’s a potential headache for developers seeking to build applications on top of that data infrastructure, but data application platform specialist Concurrent, primary sponsor of the open source Cascading application framework, sees it as an opportunity.

While Apache Hadoop began as a combination of Hadoop Distributed File System (HDFS) for file storage and MapReduce for compute, there are now a growing number of options for compute in Hadoop, including Apache Tez (a framework for near real-time big data processing), and the soon-to-be-released Apache Spark (a framework for in-memory cluster computing) and Apache Storm (a distributed computation framework for stream processing). Hadoop distribution vendor MapR even offers an alternative to HDFS in its distribution.

“Thinking in MapReduce is one thing, but then having to think in Tez is something else,” says Chris Wensel, founder and CTO of Concurrent and original author of Cascading. “It’s a huge challenge.”

“Hadoop is balkanizing and fracturing,” he adds. “There is no more Hadoop. There’s HDFS and whatever runs on top of it.”
Cascading Is a Software Abstraction Layer for Hadoop

Cascading is a software abstraction layer for Apache Hadoop that is intended to allow developers to write their data applications once and then deploy those applications on any big data infrastructure, regardless of the components in use. That’s what has allowed Concurrent to win big Web 2.0 customers like eBay, LinkedIn, Twitter andPinterest (as well as a slew of others) and what now contributes to more than 150,000 user downloads a month. Customers use it to make applications ranging from enterprise IT uses like ETL and operational analysis, to corporate apps like HR analytics, telecom apps like location-based services, marketing apps like funnel analysis and ad optimization, consumer/entertainment apps like music recommendations, finance apps like fraud and anomaly detection and health/biotech apps like veterinary diagnostics and next-generation genomics.

Wensel says he originally wrote Cascading in anger — after using MapReduce once, he was determined that no one would have to use it directly again. Now, with Cascading 3.0, announced today, the framework will go even farther — it’s not just about MapReduce anymore.
Cascading 3.0 Will Support Emerging Big Data Fabrics

Cascading 3.0 will allow data apps to execute on existing and emerging fabrics through its new customizable query planner, says Wensel. When released it will support local in-memory, Apache MapReduce and Apache Tez out of the gate, with support for Apache Spark and Apache Storm soon to follow. The idea is to allow enterprises to standardize on one API that will allow them to build data applications to solve a variety of business problems ranging from simple to complex, regardless of latency or scale. In addition, Wensel says third-party products, data applications, frameworks and dynamic programming languages built on Cascading (like Scalding or Cascalog) will immediately benefit from the portability.

Concurrent has also forged close strategic partnerships with Hortonworks (one of the primary sponsors of Apache Hadoop) and Databricks (the primary sponsor of Apache Spark). Hortonworks will now integrate the Cascading SDK with its Hortonworks Data Platform (HDP) distribution of Hadoop, and will certify and support the SDK with HDP. Cascading will also support Apache Spark in a future release and notes that companies using Cascading will be able to seamlessly run their applications on Spark.

Concurrent says Cascading 3.0 will be available early this summer and freely licensable under the Apache 2.0 License Agreement.

Creating a Big Data Factory

Creating a Big Data Factory
Gary Nakamura, CEO, Concurrent, Inc.
May 5, 2014
http://insights.wired.com/profiles/blogs/creating-a-big-data-factory

It is time to retire the myth of the data science hero – the virtuoso who slays dragons and emerges with a treasure of an amazing app based on insights from big data. If we examine leading companies, we find not only lots of smart people, but also entire processes and teams that are focused on doing great work over and over again. In successful organizations, big data applications are not the virtuoso effort of a lone data scientist. Rather, these applications are built by teams comprised of analysts, data scientists, developers and operations staff working together to rapidly build applications that yield high business value so organizations can systematically operationalize their data. The reason to move toward repeatable victories and move away from the idea of virtuosity, as this article will explain, is that virtuosity is expensive, risky and doesn’t scale.

The Big Data Factory: Less Complexity, Reproducible Victory

In the early days at almost every one of the big data pioneers, application development ran more like a virtuoso process than a factory of teams. When most companies first start experimenting with big data, this pattern usually holds. But when they want to scale fast with reproducible results, well, they quickly find they need to run more like a factory.

Here’s what happens. Excitement about big data leads to experiments and sometimes even to transformative insights. Data scientists partner with developers or just hack on their own to create an innovative application—but frankly, a brittle one, with no process to recreate or maintain it. However sweet that victory was, companies quickly learn that it probably isn’t repeatable when pursuing 10 or 15 or 20 other apps at the same time. You want victory after victory, not one brittle application after another.

In turn, companies moved away from this virtuoso process to a more methodical “Big Data Factory.” These factories exist already. For example, Twitter is not starting from scratch every time it recognizes a new opportunity to monetize Tweets; it’s building on past success. And LinkedIn applications, such as “People You May Know” and “Groups You May Like,” started out as virtuoso products but then, due to their success, became repeatable platforms to support other applications.

What’s Wrong with Virtuosos?

Businesses can’t afford the virtuoso approach to application development, relying on a single data scientist or developer for their victories. Many companies have learned lessons the hard way, finding themselves with a steep learning curve trying to maintain an application created by a virtuoso who flew the coop. Besides that, for the most important apps, no single data scientist (or developer) knows enough to create the whole thing on his or her own.

Businesses can’t afford complexity in application design, as complexity creates risk. You can’t afford to lose the one person who understands how a project all fits together; otherwise you’ll find yourself unable to maintain or iterate the application – and you must, because data is organic and changes with user behavior. Today major companies like Twitter, LinkedIn and others are entirely dependent upon adapting applications to new data and to new patterns emerging in the data.

But with big data apps, whether created by a single person or a team, complexity is the norm as developers are still using the equivalent of Hadoop assembly language (raw MapReduce) to build applications in place of more efficient tools or techniques (for example, languages such as Scala with development frameworks like Cascading). Big data companies like LinkedIn and Twitter were among the first to figure this out, as they understood that while Apache Hadoop projects were crucial for creating an infrastructure, they are not optimal for creating and deploying numerous applications. The end goal, therefore, is to build enterprise applications powered by Hadoop without having to become an expert in its intricacies.

The difference between using an inferior tool that sort of solves the problem and a tool that solves it completely should be obvious: better tools overcome complexity. Compare an application written in Cascading versus an incumbent approach. To stand up the same application, you’ll hand off one file to operations, versus 17 or 18 files with 20 different scripts across various incongruous projects.

In order to remain sustainable, businesses need repeatable, transparent development processes that can generate maintainable products—like a factory.

What Does a Big Data Factory Look Like?

Let’s compare a Big Data Factory to an automotive manufacturer. They’re alike in that an entire team designs and produces the product. The data scientist is like an automotive design engineer; developers are like the mechanical and electrical engineers who build a car prototype, operations creates and runs the factory that makes the cars; and early users who provide feedback are like test drivers. From this team comes a high-quality product—be it a new-model Chevrolet or a business application. Some applications will be more successful than others, but all of them are drivable and maintainable—and, importantly, were created using a repeatable process.

For auto manufacturing, computer-aided design (CAD) was a tremendous advance over the drafting table, and I believe application framework tools are a tremendous advance over Hadoop assembly language. Today, teams don’t need to know an assembly language like MapReduce; instead, they can focus on marrying the business problems to data. Similar to an automotive assembly line, teams can develop and iterate an application very quickly, and once they feel it’s production ready, they can launch the application.

I mentioned quick iteration, and the key is collaboration, which a user-friendly application framework enables. No one person, not even the most brilliant data scientist, can decipher exactly what is going on with ever-changing organic data and then translate that into a full-blown solution. The team as a whole needs to decipher the results of its last test run and tweak the data application as needed.

Starting Your Big Data Factory

A company that’s just entered, by desire or market pressure, into the big data business doesn’t have to go through the trenches that Twitter, eBay and LinkedIn have already dug. Most companies can’t afford it nor do they have the in-house skills or resources to navigate and survive such complexity. And why should they? We’ve got a host of big data giants today showing us how to build big data factories that turn out perfect product in repeatable processes. And just like modern auto manufacturing, it all comes down to teamwork and using the right tools.

So how does a company go about creating its own big data factory? First, start by doing your research to identify the right big data tools. As I recently told Software Magazine, I recommend selecting tools that are widely deployed, hardened and that can be early incorporated into standard development processes.

Next, think teamwork. Once you know what tools you want to use, assess the skills gap you face. You may have thought you needed someone with MapReduce skills, but after doing due diligence about available options, you will find that you can leverage existing Java skills, data analysts and data warehouse ETL experts as well as data scientists. Make sure your team includes people with deep business knowledge along with an understanding of data and its business implications.

With the right tools and with a realistic assessment of the skills you have versus those you need, you will be ready to create your own big data factory. The benefit is being able to achieve the repeatable victories that deliver real business value, month over month and year over year.

I’ll take that over virtuosity any day.

Gary Nakamura is the CEO of Concurrent.

Invest More in Metadata to Make More of Your Data

Invest More in Metadata to Make More of Your Data
Gary Nakamura, CEO, Concurrent, Inc.
May 1, 2014
http://siliconangle.com/blog/2014/05/01/invest-more-in-metadata-to-make-more-of-your-data/

Companies spend millions of dollars to get an edge from the data they own. However, all too often their efforts are out of balance. Data is hoarded without a clear purpose, and not nearly enough time is invested in capturing and analyzing the data about the data: the metadata. The fact of the matter is that metadata is a valuable tool that can answer as many questions as the data itself.

Metadata is essentially the lever that amplifies the value of data. It provides context around factors including the amount of data processed, the amount of data read and written, the data’s source, destination, and algorithms used to analyze it, the number of data versions in existence, and those versions that are used most often.

As enterprises move from ad-hoc development to operationalizing data to building teams to create and maintain a continual flow of big data applications, recognizing the potential of metadata is crucial. Metadata can provide valuable business insights to constituents of a team, an organization, and a CEO so that each player can do their job better. This is particularly true as metadata is often derived from the context or manner in which the data is used, shedding further light on who or what used the data and in what way the data provided value.

With a rich set of metadata, you can zoom into the details of your data or zoom out to see the bigger picture—all to gain insight into how your business is running. Whether your role is in compliance, operations, or application development, metadata is critical to leveraging your data. Anyone looking to find the business value in data can refer to real-world examples of metadata’s value, such as those described below, and consider how to collect and exploit it in their own organizations.

Metadata: More Valuable than the Data Itself

In almost all conspicuous data victories (both popular and mundane), data has been used in conjunction with metadata.

Facebook is a great example of a data company that is deriving billions in revenue from use of metadata. While the company receives terabytes of data per minute, it certainly isn’t reading posts to find out if you like Coca-Cola or if you’re in the market to switch car insurance. Rather, Facebook leverages its data on a deeper level, looking at what you like (or stop liking), the brands you engage with, the quizzes you take, the social games you play and the apps you use. In turn, Facebook can create a profile based on user behavior—a metadata profile that it monetizes through eerily optimized ads.

Data is organic; it ebbs, flows and oozes through an organization. Capturing its navigation points, the details of its every stop, as well as details about the people and systems that manipulated it, will tell you as much or more about your business as the data itself. People navigate themselves to what’s most useful (or in the case of Facebook, what they consider most valuable or interesting). Metadata can capture that ebb and flow, and by analyzing it, you can gain insight into how your data is used, which is often more interesting than the data itself.

In turn, the imperative is to collect that metadata gold and use it to supplement other sources of user research, such as focus groups or polls. Metadata brings into focus how your data is being leveraged, where it’s being leveraged, whether or not your resources are being used efficiently, and ultimately what’s important.

Big data almost always represents some micro level of action—the phone call, the Facebook Like, the download, the click—but that micro-level data alone offers only an incomplete story. There’s nothing compelling about a record of four or five credit card transactions in isolation, but there’s something enormously telling about metadata when it shows that these transactions took place in five different states within the same two-hour period. Metadata moves data from a micro to a summary level, which can then become the raw materials for building a model to extract meaning.

Simply put, metadata enables you to gain broader and deeper insights by looking at the usage and summary of your data. As metadata surrounds raw data, it sheds light on a wider sphere of activity, thereby expanding the context of analysis. The result is that the model of the customer, the process or other interactions becomes richer and can tell us more about the past and the future.

One example of metadata in action is Amazon.com’s anticipatory shipping. By watching how customers interact with items in their carts, Amazon has a pretty good idea when someone is going to make a purchase. The signals in the metadata (viewing the item, reading reviews, going back to the page, interacting with the shopping cart) provide enough assurance to support moving the item in question to a warehouse near the customer. That practice is not exclusive to Amazon, and given web logs, ecommerce metadata is certainly there for the taking.

We’re All in the Metadata Business

In “Using Metadata to Find Paul Revere,” Kieran Healy, a professor of sociology at Duke University, showed how the British Crown could have used metadata available at the time to identify Paul Revere as a revolutionary. On a more amorous note, UCLA math student Chris McKinlay used metadata to find a compatible woman through OkCupid.

In financial services, there’s a governance, risk and compliance angle to metadata. At a recent banking tech conference, one of the speakers voiced the need for granular details about metadata. Banks are under tremendous pressure to comply with new and ever changing regulations. In some cases, banks are required to explain exactly how they derived their analytical results, answering questions that include: Where did the source data come from? Did the processing use a join, a filter or a merge? What algorithm was used? Which predictive model? How many versions of the data exist? Which data set was ultimately used to derive the result?

Here in Silicon Valley, new tech startups are building products to help organizations and consumers make sense of their metadata, and improve their businesses and lives through their use of data. The potential for metadata to support better operations, better personal well being and better fidelity is unlimited.

Fitbit, Jawbone and Nike Fuel all track what we do, and also when and where we do it—expanding raw data from an accelerometer to generate reminders to exercise and offering analysis on the quality of our sleep. The Nest acquisition by Google and other investments in the Internet of Things movement are motivated not only by the value of such core businesses, but more importantly by the value and ability of sensors to provide a better understanding of how people live. In the manufacturing realm, ThingWorx was just acquired for its ability to create larger models and advanced automation systems out of metadata provided by sensors and industrial equipment.

As I’ve said before, we’re all in the data business. Of course Facebook and Twitter are in the data business. However, if you’re using data to gain insights and drive decisions, then you too are monetizing your data, and you too are in the business of data. Never mind about the elegance and effectiveness of your data repositories; the way to fully exploit and monetize that data is to build a smarter organization, to inform the models used to run your business and to increase the scope of what you know. That potential is powered by metadata.

About the Author

Gary Nakamura is the CEO of Concurrent, Inc. He joined Concurrent in January 2013 to lead Concurrent through its next phase of growth. Gary has a highly successful track record including significant contributions to the explosive growth of Terracotta where he was SVP & General Manager.

Do you speak Hadoop? What you need to know to get started.

Do you speak Hadoop? What you need to know to get started.
Neil A. Chaudhuri, GCN
May 1, 2014
http://gcn.com/Articles/2014/05/01/Hadoop-basics

It’s a funny word. You have only a vague notion of what it is. You’ve heard that it takes a lot of work but is potentially beneficial. Maybe if you learned more about it you too could enjoy its benefits. But even if you wanted to try it, you wouldn’t even know where to start.

If it isn’t obvious, I’m talking about Hadoop. Not Zumba.

Google first described Hadoop 10 years ago—an eternity in technology—but it is only recently that the rest of us have begun to explore its potential. Even as businesses and government agencies have built Hadoop solutions, it remains a source of confusion to many. In order to assess its potential, IT managers must first understand what Hadoop is.

Why Hadoop?

Hadoop is not just one thing. It is a combination of components that work together to facilitate cloud-scale analytics.

Hadoop provides an abstraction for running analytics on a cluster of commodity hardware when there is too much data for a single machine. The analytics program need not know about the cluster, how work is divided across it, nor the vagaries of cluster management. If a machine fails, Hadoop handles that.

HDFS stands for the Hadoop file system. It’s optimized for storing lots of data across a computing cluster. Users simply load files into HDFS, and it figures out how to distribute the data. Virtually all interactions with Hadoop involve HDFS directly or indirectly.

MapReduce is often mistaken for Hadoop itself, but in fact it is Hadoop’s programming model (commonly in Java) for analytics on data in HDFS. To understand the conceptual foundations of MapReduce, imagine two relational database tables—one for bank accounts and the second for account transactions. To find the average transaction amount for each account, a user would “map” (or transform) the two original tables to a single dataset via a join.

Then all the individual transaction amounts with the same account number would be “reduced” (or aggregated) to a single amount via a “GROUP BY” clause. MapReduce allows users to apply precisely these same concepts to a large data set distributed across a cluster, but the operations can be quite slow as files are continually read and written.

Hive allows users to project a tabular structure on the data so they can eschew the MapReduce API in favor of a SQL-like abstraction called HiveQL. Anyone used to SQL staples like “CREATE TABLE,” “SELECT,” and “GROUP BY” will find HiveQL eases the transition to Hadoop. Familiar abstraction aside, Hive queries run as MapReduce jobs under the hood.

There are other notable components in Hadoop:

HBase. The Hadoop database and an example of the so-called NoSQL databases described in a previous column.

Zookeeper. A centralized service for coordinating activities among the machines in the cluster.

Hadoop Streaming. A MapReduce API that lets developers use popular scripting languages (e.g. Ruby or Python).

Pig. An analytic abstraction similar to Hive but with a query syntax called Pig Latin (yes, seriously), which prefers a scripting pipeline approach to the SQL-like HiveQL.

Beyond Hadoop

As powerful as it is, many aspects of Hadoop remain too low-level, error-prone and slow for developers who need higher levels of abstraction. Cascading enables simpler and more testable workflows for multiple MapReduce jobs. Apache Spark lets developers treat data sets like simple lists and uses cluster memory to make jobs run faster. A companion project to Spark, Shark, similarly uses memory to make Hive queries run faster.

Apache Accumulo was originally built by the National Security Agency to supplement HBase with cell-level security. Numerous projects, including the analytics and visualization tool Lumify, are built on Accumulo.

In 2010, Google wrote a paper on Dremel, which facilitates fast queries on cloud-scale data. Dremel supports Google’s BigQuery product and has never been released, but Cloudera’s Impala and MapR’s Apache Drill are open-source implementations.

With network data (e.g. SIGINT or financial transactions) requiring graph analytics, MapReduce can be especially slow. A popular alternative programming model is Bulk-Synchronous Parallel (BSP), which abandons disk I/O for messages sent along the network. Apache Hama and Apache Giraph both use BSP to support graph analytics, and Titan is a graph database that uses HBase and other backends to store cloud-scale graphs.

While these tools excel with batch analytics, and what about analytics on data streaming from a feed like a message queue or Twitter? Apache Storm and Spark Streaming can help there.

All these tools leverage existing Hadoop artifacts like HDFS files and Hive queries. For example, I have written analytics with Spark against an existing HBase table.

Hopefully you can now assess Hadoop to see if it has a place in your enterprise. As for me, I am still trying to wrap my head around Zumba.