All posts by KIm Loughead

Making big data easier yields $10M for Hadoop startup Concurrent

Jordan Novet, VentureBeat
June 2, 2014
http://venturebeat.com/2014/06/02/making-big-data-easier-yields-10m-for-hadoop-startup-concurrent

Big data startup Concurrent proves that it pays to make hard computing simpler.

Concurrent founder Chris Wensel devised the open-source Cascading framework for abstracting away the complexities of running MapReduce jobs on the open-source Hadoop big data software for storing and processing lots of different kinds of data.

Now Concurrent has landed $10 million in new funding.

The startup aims to help more companies achieve “pure innovation through data” just as Twitter, the Climate Corp., and others have, chief executive Gary Nakamura said in an interview with VentureBeat.

It ordinarily takes special training to use Hadoop, but Concurrent makes Hadoop more accessible, and thus it becomes easier to write applications that use the magic data sitting in Hadoop.

“There’s ton of innovation around this notion of data and how to crete data products that end users will consume,” Nakamura said.

The company’s new backing is the latest evidence of a trend to make Hadoop more intuitive. Previously, we’ve seen Trifacta do well for itself by cleaning up data in Hadoop; we’ve seen Platfora make strides by constructing full-featured business-intelligence software for data in Hadoop; and we recently saw Splice Machine pull in more funding in its quest to make Hadoop better suited for real-time workloads.

And, of course, Hadoop distribution vendors Cloudera and Hortonworks have brought in big funding rounds recently, the kind of money that could help Hadoop become even more widely used at the largest companies in the world.

Bain Capital Ventures led Concurrent’s new funding round. Rembrandt Ventures and True Ventures also participated.

San Francisco-based Concurrent started in 2008. It has raised $14.95 million to date, including the $4 million round from last year. That’s when Nakamura came aboard.

The startup employs 20 people now, and that figure should double in a year, Nakamura said.

Most of the new money will go toward research and development, although some is being kept aside to pay for people who can win Concurrent new customers — likely the kinds of businesses that already use Cascading heavily.

As of now, Nakamura said, Concurrent has fewer than 10 paying customers. Then again, the company’s first commercial product, the Driven tool for managing and monitoring Cascading applications, came out just four months ago. And more than 7,000 companies use Cascading, Nakamura said. Big opportunities could lie ahead, then.

That could be especially true if Concurrent makes Driven compatible with big data technologies other than MapReduce — like the Tez framework for Hadoop, the open-source Spark engine, and the Storm stream-processing system.

“There’s a lot on the roadmap along the lines of Driven that we have yet to build,” Nakamura said.

Concurrent, Inc. Closes $10 Million in Series B Funding for Big Data App Infrastructure

Market Leader Continues Customer and Corporate Momentum, Adds Board Member and Additional Funding to Drive Product Development and Operational Capabilities

SAN FRANCISCO – June 2, 2014Concurrent, Inc., the enterprise data application platform company, today announced $10 million in Series B funding, further validating the company’s leadership in enterprise application development. New investor Bain Capital Ventures led the funding round, with the participation of existing investors Rembrandt Ventures and True Ventures. Concurrent will use the financing to drive research and development of its flagship product Driven, the industry’s first application performance management product for data-centric applications, and Cascading, the most widely used application development framework for building data-oriented applications. Concurrent will also scale its operations to meet growing customer demand.

Enterprises are rapidly adopting Hadoop. With it, comes the increasing demand to quickly operationalize their data and delivery of data products to their customers. Concurrent enables organizations to quickly build, deploy and manage data-centric applications to meet the growing demands of the business.

Concurrent leads the market in Big Data application infrastructure with Cascading and Driven. With Cascading at the core, Concurrent continues to meet customer demand for advanced enterprise application development by supporting emerging fabrics and technologies, forging important industry partnerships and making data-driven application development simpler, faster and smarter.

Concurrent’s recent corporate, customer and product milestones include:

  • Driven, the company’s flagship enterprise product, purpose built to address the pain points of enterprise data application development and data application performance management by providing unprecedented visibility and control to organizations who need to deliver operational excellence.
  • Cascading 3.0, extends the de facto standard in data application infrastructure technology. Cascading sets the standard for enterprise application development, and delivers a framework that allows enterprises to build their data applications once and provides the flexibility to deploy them on the data processing fabric that best meets their business needs.
  • Cascading surpassed more than 150,000+ user downloads per month.
  • New and expanded strategic partnerships with Hortonworks, Rackspace, EPAM and Databricks.
  • Leadership expansion with the appointment of Supreet Oberoi as vice president of field engineering to support accelerated customer and product growth.
  • Inclusion in CRN’s Big Data 100 in the “Big Data Infrastructure Tools and Services” and “Emerging Big Data Vendors” categories.
  • Recognition by the SD Times editors for the publication’s annual SD Times 100 in the “Big Data and Business Intelligence” category.

In conjunction with today’s funding news, Concurrent also announced that Salil Deshpande has joined its board of directors. Salil is managing director at Bain Capital Ventures and focuses on software infrastructure, open source and enterprise software. He also co-heads the new Palo Alto office, which has made 20 new investments in the last two years. Prior to Bain Capital Ventures, Salil spent seven years as general partner at Bay Partners, and over the last eight years has invested $120 million into 26 companies, including Typesafe, Aria Systems, MuleSoft, Buddy Media, SpringSource, Groovy and Grails, DataStax, Lending Club, Engine Yard, Dynatrace, ZeroTurnaround, Hazelcast, Iron.io and Redis Labs. Salil is on the Forbes Midas List for 2013 and 2014.

Supporting Quotes

“Concurrent continues to propel innovation forward and exemplifies just the type of company Bain Capital Ventures looks to invest in – a business at the forefront of a hot market, with a uniquely developed, defined and widely adopted technology, and a highly seasoned executive team that understands how to best serve the enterprise application infrastructure market. The growth opportunity is huge, and I look forward to serving on the company’s board to contribute to Concurrent’s future direction and growth.”
-Salil Deshpande, managing director, Bain Capital Ventures

“Our investors’ confidence in Concurrent and this latest round of funding supports our strategy and proven leadership in providing Big Data application infrastructure to enterprises. Recognizing the maturing needs of the enterprise and emergence of new technologies, we are giving organizations the application development tools and management products they need to deliver today and in the future. This funding will not only enable us to drive our R&D execution, but will also allow us to expand operational capabilities to support our rapidly expanding user and customer base.”
-Gary Nakamura, CEO, Concurrent, Inc.

Supporting Resources

About Concurrent, Inc.

Concurrent, Inc. is the leader in Big Data application infrastructure, delivering products that help enterprises create, deploy, run and manage data applications at scale. The company’s flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading, the most widely deployed technology for data applications with more than 150,000 user downloads a month. Used by thousands of businesses including eBay, Etsy, The Climate Corp and Twitter, Cascading is the de facto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco and online at http://concurrentinc.com.

###

All trademarks are the property of their respective owners.

Media Contact
Danielle Salvato-Earl
Kulesa Faul for Concurrent, Inc.
(650) 922-7287
concurrent@kulesafaul.com

Concurrent, Inc. to Present at Upcoming Leading Big Data and Hadoop Industry Events

Alexis Roos to Deliver Session on Cascading Pattern at Hadoop Summit North America, Cascading at Big Data Expo 2014

SAN FRANCISCO – May 28, 2014Concurrent, Inc., the enterprise data application platform company, today announced that Alexis Roos, senior solutions architect, will present at two upcoming industry events: Hadoop Summit North America 2014, taking place June 3-5 in San Jose, Calif. and Big Data Expo 2014, taking place June 10-12 in New York City.

At Hadoop Summit North America, Alexis will deliver a talk on Cascading Pattern, a standards-based scoring engine that leverages the power of Cascading and enables analysts and data scientists to quickly deploy machine-scoring applications on Hadoop. Additionally, a week later at Big Data Expo, Alexis will provide an introduction to Cascading, the most widely used and deployed application development framework for building data-oriented applications that enables organizations to operationalize their data and solve business problems.

Concurrent Presentations At-A-Glance

Hadoop Summit North America 2014
What: Pattern: An Open Source Project for Migrating Predictive Models from SAS, etc., onto Hadoop
Who: Alexis Roos, senior solutions architect, Concurrent, Inc.
When: Tuesday, June 3 at 1:45 p.m. PT
How: Register at http://hadoopsummit.org/san-jose/register/

Session Description
Cascading Pattern is a free, open source project, which takes models trained in popular analytics frameworks, such as SAS, Microstrategy, SQL Server, etc., and runs them at scale on Hadoop. Based on the popular Cascading framework, Cascading Pattern effectively lowers the barrier or adoption on Hadoop for developers where they can use a Java API to create complex machine-scoring applications. Alexis will provide sample code that will show applications using predictive models build in SAS and R, such as anti-fraud classifiers. Additionally, Alexis will compare variations of models for enterprise-class customer experiments.

Big Data Expo 2014
What: Using Cascading to Build Enterprise Data Applications that Drive Innovation and Advantage
Who: Alexis Roos, senior solutions architect, Concurrent, Inc.
When: Tuesday, June 10 at 1:55 p.m. ET
How: Register at http://registration.sys-con.com/

Session Description
Cascading is the most popular application development framework for building enterprise-grade data applications on Hadoop. This open source development framework allows developers to leverage their existing skillsets, such as Java and SQL, to create reliable applications without having to think in MapReduce. Alexis will provide an introduction to Cascading and then dive into using it to build applications. Attendees will learn what types of use cases exist for data-driven businesses, how to approach them with Cascading and its vast ecosystems, and the best practices for Cascading application development.

About the Speaker
Alexis Roos is a senior solutions architect focusing on Big Data solutions at Concurrent, Inc. He has more than 18 years of experience in software and sales engineering, helping both Fortune 500 firms and startups build new products that leverage Big Data, application infrastructure, security, databases and mobile technologies. Prior, Alexis worked for Sun Microsystems and Oracle for more than 13 years, as well as Couchbase and several large systems integrators in Europe.

Supporting Resources

About Concurrent, Inc.

Concurrent, Inc. is the leader in Big Data application infrastructure, delivering products that help enterprises create, deploy, run and manage data applications at scale. The company’s flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading, the most widely deployed technology for data applications with more than 150,000 user downloads a month. Used by thousands of businesses including eBay, Etsy, The Climate Corp and Twitter, Cascading is the de facto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco and online at http://concurrentinc.com.

###

Media Contact
Danielle Salvato-Earl
Kulesa Faul for Concurrent, Inc.
(650) 922-7287
concurrent@kulesafaul.com

MeetUp | Accelerate Big Data Application Development with Cascading and HDP – Jun 2, 2014

Sign-up here: http://meetu.ps/2kX222

When:
Monday, June 2, 2014
6:30 PM (Pacific Time)

Where:
Hilton Hotel, Almaden Room
300 Almaden Blvd
San Jose, CA

Accelerate Big Data Application Development with Cascading and HDP

Join Concurrent’s VP of Field Engineering, Supreet Oberoi, to learn how to accelerate your big data application development with the popular Cascading framework and Hortonworks Data Platform.

We will:

  • Describe how developers can create future proof, data-driven applications built on Apache Hadoop
  • Take advantage of the latest Hadoop processing frameworks like YARN and Tez
  • Learn more about Cascading – the application development platform that allows Java developers to focus on the data-manipulation logic while abstracting the platforms underneath
  • Demonstrate an application developed with Java-based Cascading middleware on the Hortonworks Sandbox

Concurrent relieves big data app developers of Hadoop ‘fabric anxiety’

Mary Shacklett, TechRepublic
May 16, 2014
http://www.techrepublic.com/article/concurrent-relieves-big-data-app-developers-of-hadoop-fabric-anxiety

Big data application developers need to navigate between different Hadoop fabrics to meet business requirements. Learn how one company is helping developers meet this need.

Software development using big data is no different than any other kind of software development. Organizations expect quick turnarounds; business requirements are rapidly changing; and IT must find ways to negotiate over multiple networks and operating systems for the plethora of different software and hardware platforms that enterprise applications traverse.

In one sense, this is initially easier in the big data world where enterprises are simply running on Hadoop, and not trying to reach out to other enterprise systems across the present “divide” that separates big data from other types of data processing. Despite this, there are still interoperability issues in this more constricted big data universe.

A fabric softener for big data?

These issues begin with the fact that there is more than one distribution of Hadoop. Hadoop service providers include Cloudera, Hortonworks, MapR Technologies, Amazon, Microsoft, Rackspace, Intel, IBM, Altiscale, Qubole, and others. Depending on which one you select, the underpinning of any application you develop will be slightly different. This won’t matter much if an organization remains focused on a query-only approach to big data that sticks with languages like Hive or Pig. But if the organization is intent on developing enterprise-strength applications that run off big data, having to move between different infrastructure Hadoop fabrics matters.

“Our goal is to make it easy for developers to build data applications on top of Hadoop,” said Gary Nakamura, CEO of Concurrent, which provides big data application infrastructure solutions. “The underlying structures of Hadoop can be highly complex, but if you construct an application development framework on top of it that can map to any underlying Hadoop fabric with the use of APIs (application programming interfaces), this frees the developer to focus on the layer of the application that contains the business logic.”

Relieving big data application developers of underlying “fabric anxiety” gives IT flexibility in moving from one big data computational fabric to another because it no longer has to consider the tedium of application migration in its plans. In the future, this means that depending on the business need, you will be able to run a big data application in-memory, or on Apache MapReduce, or on other big data computational fabrics. Concurrent calls this, “Write once — and deploy on your fabric of choice.”

Big data applications can also be adapted to changing business service level agreements (SLAs). Nakamura cites the example of an online retailer whose marketing department wants information on product sales performance every five hours, but then comes back to IT with a new request to see this information every 30 minutes. “Because big data historically only runs at one speed, the is a major challenge when it comes to writing big data applications,” said Nakamura, “But with the ‘write once’ capability that products like Concurrent’s Cascading 3.0 deliver, the application developer can focus on the intellectual property that the company wants to develop and on the data products he produces — without worrying about the underlying infrastructure.”

“I’m proud to see how Cascading has enabled thousands of developers and businesses to be successful at what they do,” added Chris Wensel, Concurrent’s Founder and CTO. “Cascading 3.0 will enable our users even further by simplifying application development, accelerating time to market, and allowing enterprises to leverage existing, and more importantly, new and emerging data infrastructure and programming skills.”

Products like this couldn’t be more timely, because enterprises are expecting more from big data than they were six months ago — and full-blown application development beyond simple query capabilities is just around the corner.

MeetUp | How LinkedIn uses Scalding for Data Science – May 29, 2014

Sign-up here: http://meetu.ps/2jKjZd

When:
Thursday, May 29, 2014
6:00 PM (Pacific Time)

Where:
LinkedIn
2025 Stierlin Ct.
Mountain View, CA

How LinkedIn uses Scalding for Data Science

Data Science has a reputation for being complicated, but with the help of Scala, Scalding & Cascading, most patterns can be significantly simplified. This talk shows some common patterns within data science that can be redesigned & simplified, in many cases, to almost a single line!

Join, Vitaly Gordon, Senior Data Scientist at Linkedin, to learn more about how Vitaly and his team at Linkedin use the popular Scalding dynamic programming language on top of Cascading to answer a variety of data science questions.

This talk will be hosted at LinkedIn’s offices in Mountain View. Food and drink will be provided.

Note: This is an update to Vitaly’s talk in January at the Climate Corporation. Most of the concepts and discussions will be similar.

MeetUp | Simplifying Application Development on Hadoop – May 26, 2014

Sign-up here: http://meetu.ps/2kSwC4

When:
Monday, May 26, 2014
7:00 PM (Central European Time Zone)

Where:
SoundCloud
Greifswalder Str. 2112-213
Berlin, Germany

Simplifying Application Development on Hadoop

Hadoop is an integral part of the big data ecosystem and can process huge datasets in a reliable, scalable and distributed manner. But the MapReduce programming model is cumbersome, complex and messy to develop and maintain. Is there a way to harness the power of Hadoop in an easy to use, clean framework and extend it seamlessly for predictive modeling?

This session is about simplifying application development on Hadoop by using Cascading, an open source Java application framework that provides higher level data processing abstractions. This talk focuses on our experience in building complex data flow applications, that are enterprise-grade, through test driven development. The talk will also be showcasing, through live demo and code, how an application developed with Cascading can be extended to integrate with predictive models built with analytics tools, like R, and scaled-out on a Hadoop cluster.

Our Speaker: Vinoth Kannan is a Software Developer and Big Data Engineer at WidasConcepts.

Our Hosts: SoundCloud, based in Berlin, is leveraging Scalding to create the world’s leading social sound platform.

For more information about Cascading: Visit cascading.org

Interview: Concurrent Leads the Way in Application Building on Hadoop

William Wallace, insideBIGDATA
May 14, 2014
http://inside-bigdata.com/2014/05/14/interview-concurrent-leads-way-application-building-hadoop

As enterprise adoption of Hadoop gains momentum, the need for fabric-agnostic application continues to rise as well. To meet this demand, Concurrent has recently announced the latest version of its application building platform, Cascading 3.0. We sat down with Gary Nakamura, CEO of Concurrent, to learn about this new product as well as other solutions from his company.

insideBIGDATA: What is the primary mission of Concurrent?

Gary Nakamura: Concurrent, Inc. is the leader in Big Data application infrastructure, delivering products that help enterprises create, deploy, run and manage data applications at scale. The company’s flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is also the team behind Cascading, the most widely-deployed technology for data applications with more than 150,000 user downloads a month.

insideBIGDATA: What does your company offer for enterprises seeking insight from Big Data?

Gary Nakamura: Concurrent is the team behind Cascading, the proven application development framework that makes it possible for enterprises to leverage their existing skill sets for building data-oriented applications on Hadoop. Cascading has built-in attributes that make data application development a reliable and repeatable process. Companies that standardize on Cascading can build data applications at any scale, integrate them with existing systems, employ test-driven development practices and simplify their applications’ operational complexity.

Additionally, there is strong demand for a solution that helps enterprises to understand what their data applications are doing. Concurrent is also the team behind Driven, the industry’s first application management product for data applications. Driven significantly accelerates developer productivity and provides unprecedented visibility into developing Big Data applications on Hadoop.

insideBIGDATA: Who does this technology help?

Gary Nakamura: Cascading helps enterprises solve business problems by connecting their business strategy to their technology and data with their data applications. The technology lowers the barrier for data-oriented application development so that enterprises can leverage their existing bench skills and infrastructure to build this class of applications.

Cascading is used by thousands of businesses including eBay, Etsy, The Climate Corp and Twitter, and is considered the de facto standard in open source application infrastructure technology.

insideBIGDATA: What does the Cascading 3.0 platform offer customers?

Gary Nakamura: Enterprises today are rapidly adopting Hadoop and other computation engines to process, manage and make sense of growing volumes of both unstructured and semi-structured data. At the same time, the need to rapidly and reliably build enterprise-class applications without deep knowledge of these technologies is the greatest it has ever been.

Cascading fulfills this need by allowing businesses to leverage their existing skill sets, investments and systems to build enterprise-class applications on Hadoop. With the family of Cascading applications, enterprises can apply Java, legacy SQL and predictive modeling investments, and combine the respective outputs of multiple departments into a single data processing application.

What’s new in Cascading 3.0:

  • Allows enterprises to build their data applications once, with the flexibility to run applications on the fabric that best meets their business needs.
  • Support for: local in-memory, Apache MapReduce, and Apache Tez.
  • Future support for Apache Spark™, Apache Storm and others through its new pluggable and customizable query planner.
  • Third party products, data applications, frameworks and dynamic programming languages built on Cascading will immediately benefit from this portability.
  • Compatibility with all major Hadoop vendors and service providers: Altiscale, Amazon EMR, Cloudera, Hortonworks, Intel, MapR and Qubole, among others.

insideBIGDATA: Big Data and Hadoop go hand-in-hand obviously. How did Concurrent’s relationship with the Hadoop community come about?

Gary Nakamura: Concurrent’s open source project, Cascading, was created in response to the difficulties of application development on Hadoop. After realizing how difficult it was to create applications on raw MapReduce, our founder, Chris Wensel, decided to create an application development framework that would make it possible for enterprises to simply and reliably build data applications with their existing skill sets and infrastructure.
Over the years, Cascading has become the proven enterprise application development framework for organizations to standardize on. The community around Cascading has evolved to the point where there are now several well-known dynamic programming languages built on top of Cascading (i.e. Scalding, Cascalog) as well as integrations produced by the community that extend Cascading’s capabilities. Also, based on customer demand, Concurrent has strong partner relationships in the ecosystem. All major Hadoop distribution partners make sure that their distribution is compatible with Cascading.

insideBIGDATA: Is Cascading 3.0 fabric-specific or is MapReduce the model of choice?

Gary Nakamura: Cascading 3.0 is fabric agnostic. Enterprises can make their choice of execution fabric based on the needs of their business. Cascading 3.0 is designed to work with various fabrics, whether it’s MapReduce or new and emerging fabrics such as Tez. Support is also planned for Spark and Storm. Cascading gives its users the choice on which fabric is best to use in order to meet business requirements. This means you can develop once and port to various fabrics, without the need to rewrite.

insideBIGDATA: As more and more and organizations adopt Hadoop, what is Concurrent doing to keep pace? In other words, what does the future hold?

Gary Nakamura: Our roadmaps are heavily influenced by customer demand and our aim is to be a few steps ahead of our users. With Cascading, we’re focused on solving the problem of enterprises operationalizing their data. At this point, we are seeing that organizations require emerging execution fabrics to meet a variety of business requirements (i.e. latency, scale, service level agreements). To meet customer demand, we are adding support for Apache Tez in Cascading 3.0. In future releases, Cascading will support Spark and Storm as well.

The next critical problem our customers are seeing is operational visibility for their data applications. Hadoop is becoming the operational center for enterprises and their data applications, the place where they’re looking to build hundreds of mission-critical data products. Driven is the first product in the industry that provides the needed operational visibility required from enterprises. With Driven, enterprises will be able to immediately understand what their data applications are doing in real-time. Driven accelerates the time to market for data products by providing capabilities for developers to visualize their data application, immediately diagnose failures, and optimize for application performance.

Cascading 3.0 Adds Support For Wide Range Of Computational Frameworks And Data Fabrics

Arnal Dayaratna, Ph.D., Cloud Computing Today
May 13, 2014
http://cloud-computing-today.com/2014/05/13/1070349

Today, Concurrent announces the release of Cascading 3.0, the latest version of the popular open source framework for developing and managing Big Data applications. Widely recognized as the de facto framework for the development of Big Data applications on platforms such as Apache Hadoop, Cascading simplifies application development by means of an abstraction framework that facilitates the execution and orchestration of jobs and processes. Compatible with all major Hadoop distributions, Cascading sits squarely at the heart of the Big Data revolution by streamlining the operationalization of Big Data applications in conjunction with Driven, a commercial product from Concurrent that provides visibility regarding application performance within a Hadoop cluster.

Today’s announcement extends Cascading to platforms and computational frameworks such as local in-memory, Apache MapReduce and Apache Tez. Going forward, Concurrent plans for Cascading 3.0 to ship with support for Apache Spark, Apache Storm and other computational frameworks by means of its customizable query planner, which allows customers to extend the operation of Cascading to compatible computational fabrics as illustrated below:

The breakthrough represented by today’s announcement is that it renders Cascading extensible to a variety of computational frameworks and data fabrics and thereby expands the range of use cases and environments in which Cascading can be optimally used. Moreover, the customizable query planner featured in today’s release allows customers to configure their Cascading deployment to operate in conjunction with emerging technologies and data fabrics that can now be integrated into a Cascading deployment by means of the functionality represented in Cascading 3.0.

Used by companies such as Twitter, eBay, FourSquare, Etsy and The Climate Corporation, Cascading boasts over 150,000 applications a month, more than 7,000 deployments and 10% month over month growth in downloads. The release of Cascading 3.0 builds on Concurrent’s recent partnership with Hortonworks whereby Cascading will be integrated into the Hortonworks Data Platform and Hortonworks will certify and support the delivery of Cascading in conjunction with its Hadoop distribution. Concurrent also recently revealed details of a strategic partnership with Databricks, the principal steward behind the Apache Spark project, that allows it to “operate over Spark…[the] next generation Big Data processing engine that supports batch, interactive and streaming workloads at scale.” In an interview with Cloud Computing Today, Concurrent CEO Gary Nakamura confirmed that Concurrent plans to negotiate partnerships analogous to the agreement with Hortonworks with other Hadoop distribution vendors in order to ensure that Cascading consolidates its positioning as the framework of choice for the development of Big Data applications. Overall, the release of Cascading 3.0 represents a critical product enhancement that positions Cascading to operate over a broader pasture of computational frameworks and consequently assert its relevance for Big Data application development in a variety of data and computational frameworks. More importantly, however, the product enhancement in Cascading 3.0, in conjunction with the partnership with Databricks regarding Apache Spark, suggests that Cascading is well on its way to becoming the universal framework of choice for developing and managing applications in a Big Data environment, particularly given its compatibility with a wide range of Hadoop distributions and data and computational frameworks.

Cascading Now Supports Tez–Spark and Storm Up Next

Alex Woodie, Datanami
May 13, 2014
http://www.datanami.com/2014/05/13/cascading-now-supports-tez-spark-storm-next

Concurrent, the company behind the open source Cascading framework, today unveiled a major update that will allow its customers to migrate their Hadoop applications from using MapReduce to use the new Apache Tez engine, without rewriting any business logic. Spark and Storm are next up on Cascading’s radar, Concurrent CTO Chris Wensel tells Datanami.

Analysts have billed 2014 as the year that Hadoop grows up and takes on the enterprise. Anecdotal evidence suggests that big companies, indeed, are moving away from tire-kicking phase and investing in production systems.

However, while Hadoop may have traded in his open source apparel a suit and tie, that doesn’t mean that all technology questions have been settled. Everybody in the Hadoop world seems to agree that the batch-oriented nature of MapReduce is on its way out. But what’s going to replace it? Apache Tez? Apache Spark? Apache Storm? Apache next? Nobody knows.

“You gotta pick your poison,” Wensel says. “A lot of those technologies overlap with each other. But there are also tradeoffs. And this game is all about tradeoffs.”

With Cascading, Concurrent is ideally situated to help customers minimize the risk of making the wrong tradeoff. The Cascading product does this by presenting a layer of abstraction between the application developer and the complex Hadoop APIs. The product–which is free and being downloaded 150,000 times per month–allows developers to write their business logic once using the simpler Cascading APIs (available for Java, Python, Scala, and other languages), and deploy the application using whichever Hadoop data fabric meets their needs.

For the last six months, Wensel has been working on the heart of Cascading, the customizable query planner, to enable it to support Apache Tez. It was a very big job, and Wensel did much of this work in collaboration with Hortonworks, which is particularly bullish on the prospects of Tez as a replacement for MapReduce. The result of that collaboration is now available in Cascading 3.0.

According to Wensel, it’s all about giving customers the flexibility to pick the Hadoop fabric that best fits their needs. “We’re seeing Tez and other technologies that are slightly more complex [than MapReduce], but give you more degrees of freedom to do more interesting things at the computation level,” he says.

Tez represents a “massive improvement” in the Hadoop model, and Wensel is excited to see how users will respond to support for Tez, which should provide an immediate performance boost upwards of 50 percent compared to MapReduce.

What’s more, Cascading will also allow users to dial up the performance even higher if they want, but perhaps take on more risk of the code falling and breaking. “We can give you a conservative rule engine, for Tez, but as Tez matures, we can give you a more aggressive rule engine,” Wensel says. “If you want turn it up to 11, go for it, but it might blow out your speakers.”

Next up on Wensel’s plate are Apache Spark and Storm. The company today also announced a partnership with Databricks, the company behind Apache Spark. Wensel and company will set out this summer to enable Cascading apps to utilize Spark within Hadoop. Some of the work he did on supporting Tez will carry over to Spark, or at least make it somewhat easier to support, he says.

While Spark seems to have a lot of momentum at the moment, Wensel still sees a bit of risk with Spark still, and doesn’t seem entirely sold on it. “People want to try Spark. We get it,” he says. “People want us to port Cascading to Spark so they can see if it’s better. I don’t know anybody in production with Spark, but I don’t know anybody in production on Tez either.” The timeframe? “We’re definitely going to get to that as quickly as we can,” he says. “We hope to get to it this summer.”

The way Wensel sees it, nobody can predict what technology is going to win in the end. It could be Tez, or it could be Apache Spark. “What you don’t want is the risk of learning a new API or a language on a new API just to get the tradeoff to realize the tradeoff was a bad one,” he says. “What did you do? Spend six months figuring out that was a huge mistake.”

It’s all about weighing the tradeoffs, and allowing people to experiment with the various Hadoop fabrics to find out what works best for them and avoid those million-dollar mistakes. By allowing people to experiment with Tez, Spark, and MapReduce, Cascading will let developers make apples to apples comparisons among the various “Baby Bear, Mama Bear, and Papa Bear technologies,” as the colorful Wensel puts it.

“If Spark doesn’t scale, then they’ll go to Tez. But Tez might be slower,” Wensel says. “If they have a smaller application that doesn’t [need to] scale, maybe they could leave that on Spark. But they can make these decisions without having to rewrite their applications. If they’re okay with 2 percent of their jobs failing, then maybe they’ll pick a different technology that’s faster, but maybe it will fail more frequently. If they never can have it ever fail and they just need predictability, they may stick with [Papa Bear] MapReduce because it’s extremely stable and mature. People want to be able to make these choices. They don’t want just one technology.” Amen to that.