All posts by KIm Loughead

Big Data: How to “Write Once and Deploy” across Big Data Fabrics

Dick Weisinger, Formtek
May 13, 2014
http://formtek.com/blog/big-data-how-to-write-once-and-deploy-across-big-data-fabrics

Cascading is a Java-based framework for big-data/data-centric application development created and supported by Concurrent. The framework abstracts and hides complex implementation details involved in the writing of big data applications. Cascading is used by companies like eBay, Linkedin, the Climate Corp and Twitter. The framework has seen broad adoption across many different industry segments that include Finance, Telecom, Marketing, Entertainment, and Enterprise IT. We covered the 2.5 release of the framework last November.

Concurrent announced the release of the Cascading 3.0 framework today. This newest release expands the support for the underlying data engines which can be plugged into the framework. Cascading 3.0 will immediately support out of the box the local in-memory fabric, which has been in Cascading since the beginning. Apache MapReduce, which has been the default and the foundation on which Hadoop is been built, will be available too. Beyond that, the 3.0 release now also supports Apache Tez, and in the very near future support will be added for Apache Spark, followed by support for Apache Storm.

Gary Nakamura, Concurrent CEO, said that “what we’ve done in Cascading 3.0 is that we’ve allowed for data applications to execute on different fabrics. So essentially we’ve made Hadoop and MapReduce, and Spark and Tez an implementation detail of the framework. The enterprise can now develop these applications unencumbered without having to think about latency or scale, and then simply pick the modality on which they want to deploy their application.”

Cascading has become a top choice for building data-centric applications. Since mid-2012, the Cascading framework has seen 10 percent month-over-month growth. The number of software downloads has gone from roughly 20,000 per month to more than 150,000 downloads per month, and as a result, there are now more than 7000 deployments of the Cascading framework.

By using Cascading some of the risk of data-centric application development can be taken out of the equation. Cascading provides a stable API and framework on which to build data applications and, once it’s in place, then it’s possible to swap in or out whichever underlying data engine or fabric that you want to deploy on.

Nakamura commented that “for anyone developing on Cascading today, it will become very easy them to migrate their data-centric applications to new computation fabrics when the enterprise is ready to upgrade their Hadoop distribution. They could standardize on one API to solve a variety of business problems. ISVs can now leverage Cascading as the interface between their value-added solution and Hadoop or Spark or Storm without having to write directly to each of the different fabrics for the different modalities that they want to offer to their end customers. This translates to other data apps that are built on top of Cascading and they will benefit from this portability.”

What types of organizations are benefiting from using Big Data? Nakamura thinks that businesses which can see the value in Big Data and then deeply architect the use of it in their enterprise applications will benefit the most. ”End users that are just stopping at ad-hoc have a hard time having conversations with their bosses about budgets going forward. The ones that are building and operationalizing data inside of Hadoop and are standing up enterprise applications and consistently delivering data products to their end users” are seeing the rewards that can be derived with Big Data. ”It’s not necessarily a conversation about how much money we are making off of this data product or how much money are we saving because of it. It’s more transformative.”

Cascading 3.0 Adds Multiple Framework Support. Concurrent Driven Manages Big Data Apps

Boris Lublinsky, InfoQ
May 13, 2014
http://www.infoq.com/news/2014/05/driven

When it comes to implementing Big Data applications, companies today can choose from multiple frameworks ranging from Apache MapReduce to Apache Tez to Apache Spark to Apache Storm. Each one of these frameworks has its own advantages and drawbacks, and can be most appropriate for certain applications. Although it is possible to run all frameworks on a single Apache YARN cluster, each one of them has a slightly different programming model and a different set of APIs. This means that porting a given application from one framework to another might prove to be non-trivial.

Cascading 3.0, a new release of Concurrent’s flagship product, solves many of these issues. Cascading is one of the most popular Java Domain Specific Languages (DSL) initially introduced in late 2007 as a DSL to define and implement functional programming for large scale data workflows on top of the low-level MapReduce APIs. Cascading is based on a “plumbing” metaphor allowing to assemble data pipelines: supported high-level constructs allow it to split, merge, and join streams of data, and to perform operations on the streams.

Such an approach allows users to represent their Cascading applications as a directed acyclic graph (DAG), which is mapped by Cascading’s planner to the underlying framework, originally MapReduce.

Cascading 3.0 goes beyond MapReduce by allowing enterprise developers to build their data applications once, then run those applications on the framework that best meets their business needs. Cascading 3.0 will initially ship with support for: local in-memory, Apache MapReduce (support for both Hadoop 1 and 2 are provided), and Apache Tez. Soon thereafter, with community support, Apache Spark™, Apache Storm and others will be supported through its new pluggable and customizable planner.

A new planner introduced in Cascading 3.0 allows users to create rules that assert correctness of the graph and annotate the graph nodes with meta-data used at runtime based on local topology. The planner also allows for transformation of the graph in order to balance, insert, remove, or reorder nodes. It also partitions the graph to find recursively smaller sub-graphs that map to compute units (like a Map or a Reduce node, or a Tez process).

Once the appropriate compute units are defined, Cascading builds the execution configuration plan which leverages a framework specific jar (and Maven POM) that insulates all the framework’s APIs. Both the jar and POM are provided by Cascading.

The open pluggable architecture implemented by Cascading 3.0 provides easy extensibility of the product to support additional frameworks. This can be done by implementing a new set of rules for a given framework and framework-specific jar and POMs.

In addition to open source Cascading 3.0, Concurrent also recently announced its commercial product Driven, which provides real-time monitoring, operational control and performance management for Cascading applications. Driven provides a set of screens to support the following features:

  • Understand – Seeing a data app executing in real time and visually drilling down into each unit of work.
  • Diagnose – Quickly identifying failed (including failure reasons) and poorly performing applications.
  • Optimize – Visually breaking down vital application metrics to spot performance issues and anomalies.
  • Track – Viewing and comparing the history of application’s run-time performance.

The new products released by Concurrent will ease application migration to new computation frameworks like Apache Tez and other best of breed technologies. This allows enterprises to standardize on a single API to meet business challenges and solve a variety of business problems and introduce new more suitable Big Data frameworks without massive application rewrites. Driven provides more operational visibility of new and existing Big Data applications, from development to production.

Concurrent Announces Upgrade to Cascading Big Data Development Framework

David Ramel, Application Development Trends
May 13, 2014
http://adtmag.com/Articles/2014/05/13/cascading-upgrade.aspx

Concurrent Inc. today announced an upgrade of its Cascading development framework for building Big Data applications, offering more choices for working with the Apache Hadoop ecosystem.

Cascading 3.0, due early this summer, features a new pluggable query planner that can be customized for working with different technologies, basically offering alternatives to using the problematic MapReduce programming model, sometimes referred to as an execution fabric.

MapReduce was an original integral part of the Hadoop system of Big Data programming, handling the compute function for working with data stored in the Hadoop Distributed File System (HDFS). Notoriously complex and inflexible, the limitations of MapReduce reportedly helped prompt Chris Wensel to found Concurrent to provide more options and simplify Big Data programming with Hadoop.

One of those options in Cascading 3.0 is the ability to work with the emerging Apache Tez project, an alternative application framework that takes advantage of improvements to the Hadoop ecosystem such as Apache Hadoop YARN that came with Hadoop’s upgrade to version 2. YARN is sometimes referred to as Yet Another Resource Manager and described as a tool to separate resource management from processing components.

“For existing users, they will be able to migrate their existing applications from Hadoop 1 or 2 MapReduce to Apache Tez trivially,” Wensel told this site. “Or any other new fabric that the community adds support for.

“And, as Tez matures and gains features, users can create or use new rules in our query planner to experiment with different features or optimizations in Tez,” said Wenzel, the company’s CTO. “The same will be true with other fabrics.”

Cascading 3.0 will ship with support for both MapReduce and Tez and will feature local in-memory computing. Concurrent said soon after release, community support for the open source software is expected to add the capability for the new pluggable and customizable query planner to work with other alternative technologies such as Apache Spark and Apache Storm.

Concurrent said it designed Cascading 3.0 to help developers build data applications once and then run those applications on the most appropriate fabric, providing flexibility to solve business problems of varying complexity, regardless of latency or scale.

“In the same way people have adopted Spring and J2EE for container and service-oriented application development, Cascading is used by enterprises to develop data-oriented applications,” Wensel told this site.

Concurrent, which just a few weeks ago announced a partnership with major Big Data player Hortonworks, today also announced a partnership with Databricks that will let Cascading work with the Apache Spark processing engine.

Databricks, founded by the creators of Apache Spark — recently upgraded to a top-level Apache project — builds software for analyzing data and extracting value. Spark is an open source processing engine — or data analytics cluster computing framework — that speeds up Big Data analytics through the use of in-memory computing and other means.

“One of our primary goals is to drive broad adoption of Spark and ensure a great experience for users,” said Databricks CEO Ion Stoica in a statement. “By partnering with Concurrent, all the developers who already use Cascading will be able to deploy their applications on Spark, while Spark users benefit from direct access to all of the benefits of Cascading and Driven. We are committed to open source and partnering with proven market leaders like Concurrent to drive new growth and innovation in the Big Data community.”

Driven is Concurrent’s flagship commercial offering designed to enhance the development and management of data applications for enterprises.

Cascading 3.0 Future-Proofs Data-Centric Application Development on Hadoop

Joyce Wells, Database Trends and Applications
May 13, 2014
http://www.dbta.com/Editorial/News-Flashes/Cascading-30-Future-Proofs-Data-Centric-Application-Development-on-Hadoop-96955.aspx

Concurrent, Inc., the company behind Cascading, an open source application development framework for building data applications on Hadoop, has announced Cascading 3.0, which CEO Gary Nakamura says will give enterprises the flexibility to build their data-oriented applications on Hadoop once, and then run the applications on the platform that best meets their business needs.

Cascading is very focused around reliable and reusable tools to build data products but also give users with varying skill sets the freedom to solve problems, said Nakamura.

“Cascading 3.0 will allow applications to execute on whatever fabric that we support, and that end users want to run on, through our new query planner – that means that an application that was written 2 years ago and that solves a particular business problem for an end user can very quickly be migrated over to a newer fabric like Apache Tez or Apache Spark,” Nakamura said. “Enterprise users can write an application once and deploy on whatever fabric they would like depending on what the business problem is.”

The added migration flexibility is critical for the Hadoop community, says Nakamura. For existing customers, it means ease of migration to new computation platforms with very little effort. Longer term, it is important for mainstream adoption because the rapid innovation that is happening inside of Hadoop is causing some enterprises to sit on the sidelines, concerned that it is too complex, and if they build an application now and a platform is changed, they will have to do a complete rebuild.

“What we are providing is a standard way to develop data-centric applications without the risk of having to rewrite those applications when distributions or the providers of the computation engines underneath it change direction one day.”

Cascading 3.0 will ship with support for local in-memory, Apache MapReduce, and Apache Tez. Shortly after, support will be added for Apache Spark, Apache Storm and others through the new pluggable and customizable query planner.

Third-party products, data applications, frameworks and dynamic programming languages built on Cascading will benefit from this portability, according to the company. Cascading offers compatibility with all major Hadoop vendors and service providers including Altiscale, Amazon EMR, Cloudera, Hortonworks, Intel, MapR and Qubole, as well as others.

Cascading is used by enterprise Java developers first and foremost, but Concurrent offers interfaces that allow users working with R, MicroStrategy, or SAS to take their predictive models and deploy them on Hadoop. “We also have a SQL interface so SQL end users and anyone who knows how to program with SQL can leverage Cascading,” said Nakamura.

Concurrent also recently announced strategic industry partnerships with Hortonworks and Databricks, and new product innovation with the introduction of Driven, its flagship product that provides application performance management for data-centric applications from development through production.

Cascading version 3.0 will be available in early summer and freely licensable under the Apache 2.0 License Agreement. Concurrent also offers standard and premium support subscriptions for enterprise use.

With Cascading 3.0, application developers can operationalize the Hadoop ecosystem

Maria Deutscher, Silicon Angle
May 13, 2014
http://siliconangle.com/blog/2014/05/13/with-cascading-3-0-application-developers-can-operationalize-the-hadoop-ecosystem

Concurrent, an up-and-coming startup working to simplify the creation of data-driven applications, has pulled the curtains back on a revamped version of its flagship development framework that facilitates integration across the full spectrum of technologies in the Hadoop ecosystem to enable an entirely new set of use cases.

Cascading, as the San Francisco-based company’s software is called, is available under an Apache license and serves as an abstraction layer between Hadoop and the applications using it, shielding developers from the inherent complexity of MapReduce. The third release, introduced this morning, extends that simplicity to the dozens of complementary open source components available for the batch processing platform in order to make the capabilities of those tools more accessible to enterprise applications.

The new functionality represents a major step forward towards Concurrent’s vision for democratizing analytics, which is founded on the classical notion that business logic must to be decoupled from the code that handles information.

“Building applications on top of Hadoop was very difficult. That’s why our founder Chris Wensel created a framework so you could have a separate business logic layer from the data layer, and it’s written in Java so any Java programmer can pick it up,” Guy Nakamura, the CEO of Concurrent, explained to SiliconANGLE in an exclusive interview on theCUBE during O’Reilly Fluent Conference 2013.

“The requirement for the enterprise is not to learn new skills for Hadoop but to leverage existing skills, existing systems and existing investments they already made in their infrastructure,” Nakamura added. Cascading now delivers that abstraction for the various specialized tools in the Hadoop ecosystem as well through both direct and indirect support.

Out-of-the-box, the framework is compatible with Tez, a distributed execution engine that offers superior performance to MapReduce with lower latency, a combination that is especially useful for fast-moving streaming workloads such as sensory data. Other technologies can be plugged into Cascading utilizing a new built-in query planner that Concurrent said will be used to add support for two additional Apache projects in the near future.

One of the items on the list is Spark, a separate implementation of the concepts detailed in the 2007 Microsoft research paper Tez is based on that is generally considered more mature and better suited for production as a result. The firm said that Cascading will eventually also work with Storm, a third real-time processing framework that was open sourced after Twitter acquired original developer BackType.

Of particular note is that, through the integrations and a new local caching function, the latest version of framework allows for in-memory processing. That is significant because, as Wikibon co-founder and chief analyst Dave Vellante explained in recent segment on theCUBE, eliminating the overhead associated with retrieving information from disk can improve application performance by several orders of the magnitude, removing the I/O limitations that have historically prevented developers from taking full advantage of their data.

Cascading Allows Apps to Execute on All Big Data Fabrics

Thor Olavsrud, CIO Magazine
May 13, 2014
http://www.cio.com/article/752747/Cascading_Allows_Apps_to_Execute_on_All_Big_Data_Fabrics

Concurrent says Cascading 3.0 will support local in-memory, Apache MapReduce and Apache Tez out of the gate with support for Apache Spark and Apache Storm soon to follow.

CIO — Organizations are increasingly focusing on building enterprise data applications on top of their Hadoop and NoSQL infrastructure. But even as that’s happening, Hadoop itself is becoming much more diverse and complex. That’s a potential headache for developers seeking to build applications on top of that data infrastructure, but data application platform specialist Concurrent, primary sponsor of the open source Cascading application framework, sees it as an opportunity.

While Apache Hadoop began as a combination of Hadoop Distributed File System (HDFS) for file storage and MapReduce for compute, there are now a growing number of options for compute in Hadoop, including Apache Tez (a framework for near real-time big data processing), and the soon-to-be-released Apache Spark (a framework for in-memory cluster computing) and Apache Storm (a distributed computation framework for stream processing). Hadoop distribution vendor MapR even offers an alternative to HDFS in its distribution.

“Thinking in MapReduce is one thing, but then having to think in Tez is something else,” says Chris Wensel, founder and CTO of Concurrent and original author of Cascading. “It’s a huge challenge.”

“Hadoop is balkanizing and fracturing,” he adds. “There is no more Hadoop. There’s HDFS and whatever runs on top of it.”
Cascading Is a Software Abstraction Layer for Hadoop

Cascading is a software abstraction layer for Apache Hadoop that is intended to allow developers to write their data applications once and then deploy those applications on any big data infrastructure, regardless of the components in use. That’s what has allowed Concurrent to win big Web 2.0 customers like eBay, LinkedIn, Twitter andPinterest (as well as a slew of others) and what now contributes to more than 150,000 user downloads a month. Customers use it to make applications ranging from enterprise IT uses like ETL and operational analysis, to corporate apps like HR analytics, telecom apps like location-based services, marketing apps like funnel analysis and ad optimization, consumer/entertainment apps like music recommendations, finance apps like fraud and anomaly detection and health/biotech apps like veterinary diagnostics and next-generation genomics.

Wensel says he originally wrote Cascading in anger — after using MapReduce once, he was determined that no one would have to use it directly again. Now, with Cascading 3.0, announced today, the framework will go even farther — it’s not just about MapReduce anymore.
Cascading 3.0 Will Support Emerging Big Data Fabrics

Cascading 3.0 will allow data apps to execute on existing and emerging fabrics through its new customizable query planner, says Wensel. When released it will support local in-memory, Apache MapReduce and Apache Tez out of the gate, with support for Apache Spark and Apache Storm soon to follow. The idea is to allow enterprises to standardize on one API that will allow them to build data applications to solve a variety of business problems ranging from simple to complex, regardless of latency or scale. In addition, Wensel says third-party products, data applications, frameworks and dynamic programming languages built on Cascading (like Scalding or Cascalog) will immediately benefit from the portability.

Concurrent has also forged close strategic partnerships with Hortonworks (one of the primary sponsors of Apache Hadoop) and Databricks (the primary sponsor of Apache Spark). Hortonworks will now integrate the Cascading SDK with its Hortonworks Data Platform (HDP) distribution of Hadoop, and will certify and support the SDK with HDP. Cascading will also support Apache Spark in a future release and notes that companies using Cascading will be able to seamlessly run their applications on Spark.

Concurrent says Cascading 3.0 will be available early this summer and freely licensable under the Apache 2.0 License Agreement.

Concurrent, Inc. and Databricks Partner to Make Enterprise Data Application Development Simpler, Faster, Smarter

Strategic Partnership Answers Customer Demand by Bringing Together Leading Open Source Platforms – Cascading and Apache Spark

SAN FRANCISCO and BERKELEY – May 13, 2014Concurrent, Inc., the enterprise data application platform company, and Databricks, the company founded by the creators of Apache Spark, today announced a strategic partnership to enable Cascading to seamlessly operate over Spark, a next generation Big Data processing engine that supports batch, interactive and streaming workloads at scale. This partnership will enable both companies to meet customer demand for simpler and more flexible enterprise application development and give the thousands of enterprises using Cascading the ability to leverage Spark, which is now a part of all major Hadoop distributions, and also available from enterprise database and NoSQL vendors.

While enterprises are heavily investing in building data-centric applications to operationalize their data, these data applications must be able to meet business requirements that vary in latency, scale and service levels. To meet these requirements, enterprises are leveraging Spark’s unique in-memory computing capabilities and full breadth of functionality to deliver the necessary speed and sophistication required for data processing at scale.

Enterprises looking to run their data applications on Spark will be able to leverage Cascading, the proven framework that simplifies enterprise application development. With more than 150,000 downloads a month, Cascading is the enterprise development framework of choice for building data-centric applications, and can soon be utilized to build robust data applications and deploy them at scale on Spark. This ability provides enterprises the flexibility to easily adapt their data applications to meet business challenges and solve a variety of business problems ranging from simple to complex, regardless of latency or scale.

Concurrent and Databricks are empowering enterprises to simplify their data application development, while providing the flexibility and performance benefits of Spark. Developers can leverage Cascading’s framework for robust, easy and seamless application development with Spark’s execution engine for maximum performance and versatility. These benefits extend to any Cascading-based dynamic programming language, including Scalding, Cascalog, Lingual, Pattern and Driven. With Driven, enterprises will gain operational visibility to their data applications running on Spark, accelerating the time to market for their applications.

Cascading will add Spark support in the near future and both are freely licensable under the Apache 2.0 License Agreement. For more information and notification of availability please visit us at http://www.cascading.org/spark-support.

Supporting Quotes

“As the open source Hadoop community has evolved, Spark has emerged as a key addition to the burgeoning ecosystem of frameworks that provide application builders new ways to generate insight from the large volumes of data in the enterprise. Cloudera, the first commercial distribution to provide support for Spark, is excited to see the addition of Spark to Cascading and Driven, key technologies used by enterprise application developers. With the partnership of Concurrent and Cloudera, and certification of Cascading on CDH 5, Cloudera customers can rely on building highly scalable applications on top of the Concurrent platform.”
-Jairam Ranganathan, Director of Product, Cloudera

“One of our primary goals is to drive broad adoption of Spark and ensure a great experience for users. By partnering with Concurrent, all the developers who already use Cascading will be able to deploy their applications on Spark, while Spark users benefit from direct access to all of the benefits of Cascading and Driven. We are committed to open source and partnering with proven market leaders like Concurrent to drive new growth and innovation in the Big Data community.”
-Ion Stoica, CEO, Databricks

“New business problems demand applications that can connect business and data. As the community gets serious about building data applications, we’re supporting the emergence of new fabrics to give users maximum choice and flexibility in building and deploying their Big Data apps. Concurrent continues to set the industry standard for building data applications and Databricks is an ideal partner to extend the adoption and contribute to the core functionality of Cascading and Driven.”
-Gary Nakamura, CEO, Concurrent, Inc.

Supporting Resources

About Databricks
Databricks is using cutting-edge technology based on years of research to build next-generation software for analyzing and extracting value from data. Its founders created Apache Spark and Shark, and are deeply committed to open source. Based in Berkeley, California, Databricks is venture-backed by Andreessen Horowitz.

About Concurrent, Inc.
Concurrent, Inc. is the leader in Big Data application infrastructure, delivering products that help enterprises create, deploy, run and manage data applications at scale. The company’s flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading™, the most widely deployed technology for data applications with more than 150,000 user downloads a month. Used by thousands of businesses including eBay, Etsy, The Climate Corp and Twitter, Cascading is the de facto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco and online at http://concurrentinc.com.

###

All trademarks are the property of their respective owners.

Media Contact
Danielle Salvato-Earl
Kulesa Faul for Concurrent, Inc.
(650) 922-7287
concurrent@kulesafaul.com

Concurrent, Inc. Leads the Market for Data-Driven Enterprise Application Development

New Cascading 3.0 supports multiple data processing fabrics for running enterprise data applications at the speed of business

SAN FRANCISCO – May 13, 2014Concurrent, Inc., the enterprise data application platform company, today announced product and corporate momentum securing the company’s leadership in enterprise application development. The company recently announced strategic industry partnerships with Hortonworks and Databricks, as well as new product innovation with the introduction of Driven, the industry’s first application performance management product for data-centric applications. Today Concurrent also introduced the next version of Cascading, the most widely used application development framework for building data applications on technologies like Apache Hadoop.

Enterprises have always been operationalizing their data. But as business needs continue to change and new technologies – such as Apache Hadoop and now Apache Tez – emerge, organizations need a reliable way to quickly build and consistently deliver these data products. This requires leveraging existing skill sets, while meeting new requirements (i.e. latency, scale, service level agreements) supported by these emerging technologies.

Cascading 3.0 Sets the Standard for Enterprise Application Development

With more than 150,000 user downloads a month, Cascading is the de facto standard in open source application infrastructure technology. Supported by key strategic partnerships with Hortonworks and Databricks, and broad support with all major Hadoop distributions, Cascading is the enterprise development framework of choice for data-centric applications. Cascading accelerates and simplifies enterprise application development, and meets a variety of enterprise use cases, from simple to complex.

Cascading 3.0 is a major leap forward in enterprise data-centric application development. Features and benefits include:

  • Cascading 3.0 provides the most comprehensive data application framework to meet business challenges and solve a variety of business problems ranging from simple to complex, regardless of latency or scale.
  • Cascading 3.0 allows enterprises to build their data applications once, while providing the flexibility to run applications on the fabric that best meets their business needs.
  • Cascading 3.0 will ship with support for: local in-memory, Apache MapReduce, and Apache Tez.
  • Soon thereafter, with community support, Apache Spark™, Apache Storm and others will be supported through its new pluggable and customizable query planner.
  • Third party products, data applications, frameworks and dynamic programming languages built on Cascading will immediately benefit from this portability.
  • Cascading offers compatibility with all major Hadoop vendors and service providers: Altiscale, Amazon EMR, Cloudera, Hortonworks, Intel, MapR and Qubole, among others.

Concurrent is leading the market in Big Data application infrastructure with Cascading and Driven. With Cascading at the core, Concurrent continues to meet customer demand for advanced enterprise application development by supporting emerging fabrics and technologies, forging important industry partnerships and making data-driven application development simpler, faster and smarter.

Supporting Quotes

“As a partner, we welcome Concurrent’s contributions to further expand the Hadoop ecosystem by enabling faster development and deployment of data-centric applications on 100-percent open source Hadoop. Together we can help our customers adapt to evolving business needs and derive even more value from their Big Data solutions.”
-John Kreisa, vice president of strategic marketing, Hortonworks

“I’m proud to see how Cascading has enabled thousands of developers and businesses to be successful at what they do. Cascading 3.0 will enable our users even further by simplifying application development, accelerating time to market and allowing enterprises to leverage existing, and more importantly, new and emerging data infrastructure and programming skills.”
-Chris Wensel, founder and CTO, Concurrent, Inc.

Availability and Pricing

Cascading version 3.0 will be available early summer and freely licensable under the Apache 2.0 License Agreement. To learn more about Cascading, visit http://cascading.org. Concurrent also offers standard and premium support subscriptions for enterprise use.

Supporting Resources

About Concurrent, Inc.

Concurrent, Inc. is the leader in Big Data application infrastructure, delivering products that help enterprises create, deploy, run and manage data applications at scale. The company’s flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading, the most widely deployed technology for data applications with more than 150,000 user downloads a month. Used by thousands of businesses including eBay, Etsy, The Climate Corp and Twitter, Cascading is the de facto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco and online at http://concurrentinc.com.

###

All trademarks are the property of their respective owners.

Media Contact
Danielle Salvato-Earl
Kulesa Faul for Concurrent, Inc.
(650) 922-7287
concurrent@kulesafaul.com

Creating a Big Data Factory

Creating a Big Data Factory
Gary Nakamura, CEO, Concurrent, Inc.
May 5, 2014
http://insights.wired.com/profiles/blogs/creating-a-big-data-factory

It is time to retire the myth of the data science hero – the virtuoso who slays dragons and emerges with a treasure of an amazing app based on insights from big data. If we examine leading companies, we find not only lots of smart people, but also entire processes and teams that are focused on doing great work over and over again. In successful organizations, big data applications are not the virtuoso effort of a lone data scientist. Rather, these applications are built by teams comprised of analysts, data scientists, developers and operations staff working together to rapidly build applications that yield high business value so organizations can systematically operationalize their data. The reason to move toward repeatable victories and move away from the idea of virtuosity, as this article will explain, is that virtuosity is expensive, risky and doesn’t scale.

The Big Data Factory: Less Complexity, Reproducible Victory

In the early days at almost every one of the big data pioneers, application development ran more like a virtuoso process than a factory of teams. When most companies first start experimenting with big data, this pattern usually holds. But when they want to scale fast with reproducible results, well, they quickly find they need to run more like a factory.

Here’s what happens. Excitement about big data leads to experiments and sometimes even to transformative insights. Data scientists partner with developers or just hack on their own to create an innovative application—but frankly, a brittle one, with no process to recreate or maintain it. However sweet that victory was, companies quickly learn that it probably isn’t repeatable when pursuing 10 or 15 or 20 other apps at the same time. You want victory after victory, not one brittle application after another.

In turn, companies moved away from this virtuoso process to a more methodical “Big Data Factory.” These factories exist already. For example, Twitter is not starting from scratch every time it recognizes a new opportunity to monetize Tweets; it’s building on past success. And LinkedIn applications, such as “People You May Know” and “Groups You May Like,” started out as virtuoso products but then, due to their success, became repeatable platforms to support other applications.

What’s Wrong with Virtuosos?

Businesses can’t afford the virtuoso approach to application development, relying on a single data scientist or developer for their victories. Many companies have learned lessons the hard way, finding themselves with a steep learning curve trying to maintain an application created by a virtuoso who flew the coop. Besides that, for the most important apps, no single data scientist (or developer) knows enough to create the whole thing on his or her own.

Businesses can’t afford complexity in application design, as complexity creates risk. You can’t afford to lose the one person who understands how a project all fits together; otherwise you’ll find yourself unable to maintain or iterate the application – and you must, because data is organic and changes with user behavior. Today major companies like Twitter, LinkedIn and others are entirely dependent upon adapting applications to new data and to new patterns emerging in the data.

But with big data apps, whether created by a single person or a team, complexity is the norm as developers are still using the equivalent of Hadoop assembly language (raw MapReduce) to build applications in place of more efficient tools or techniques (for example, languages such as Scala with development frameworks like Cascading). Big data companies like LinkedIn and Twitter were among the first to figure this out, as they understood that while Apache Hadoop projects were crucial for creating an infrastructure, they are not optimal for creating and deploying numerous applications. The end goal, therefore, is to build enterprise applications powered by Hadoop without having to become an expert in its intricacies.

The difference between using an inferior tool that sort of solves the problem and a tool that solves it completely should be obvious: better tools overcome complexity. Compare an application written in Cascading versus an incumbent approach. To stand up the same application, you’ll hand off one file to operations, versus 17 or 18 files with 20 different scripts across various incongruous projects.

In order to remain sustainable, businesses need repeatable, transparent development processes that can generate maintainable products—like a factory.

What Does a Big Data Factory Look Like?

Let’s compare a Big Data Factory to an automotive manufacturer. They’re alike in that an entire team designs and produces the product. The data scientist is like an automotive design engineer; developers are like the mechanical and electrical engineers who build a car prototype, operations creates and runs the factory that makes the cars; and early users who provide feedback are like test drivers. From this team comes a high-quality product—be it a new-model Chevrolet or a business application. Some applications will be more successful than others, but all of them are drivable and maintainable—and, importantly, were created using a repeatable process.

For auto manufacturing, computer-aided design (CAD) was a tremendous advance over the drafting table, and I believe application framework tools are a tremendous advance over Hadoop assembly language. Today, teams don’t need to know an assembly language like MapReduce; instead, they can focus on marrying the business problems to data. Similar to an automotive assembly line, teams can develop and iterate an application very quickly, and once they feel it’s production ready, they can launch the application.

I mentioned quick iteration, and the key is collaboration, which a user-friendly application framework enables. No one person, not even the most brilliant data scientist, can decipher exactly what is going on with ever-changing organic data and then translate that into a full-blown solution. The team as a whole needs to decipher the results of its last test run and tweak the data application as needed.

Starting Your Big Data Factory

A company that’s just entered, by desire or market pressure, into the big data business doesn’t have to go through the trenches that Twitter, eBay and LinkedIn have already dug. Most companies can’t afford it nor do they have the in-house skills or resources to navigate and survive such complexity. And why should they? We’ve got a host of big data giants today showing us how to build big data factories that turn out perfect product in repeatable processes. And just like modern auto manufacturing, it all comes down to teamwork and using the right tools.

So how does a company go about creating its own big data factory? First, start by doing your research to identify the right big data tools. As I recently told Software Magazine, I recommend selecting tools that are widely deployed, hardened and that can be early incorporated into standard development processes.

Next, think teamwork. Once you know what tools you want to use, assess the skills gap you face. You may have thought you needed someone with MapReduce skills, but after doing due diligence about available options, you will find that you can leverage existing Java skills, data analysts and data warehouse ETL experts as well as data scientists. Make sure your team includes people with deep business knowledge along with an understanding of data and its business implications.

With the right tools and with a realistic assessment of the skills you have versus those you need, you will be ready to create your own big data factory. The benefit is being able to achieve the repeatable victories that deliver real business value, month over month and year over year.

I’ll take that over virtuosity any day.

Gary Nakamura is the CEO of Concurrent.

Invest More in Metadata to Make More of Your Data

Invest More in Metadata to Make More of Your Data
Gary Nakamura, CEO, Concurrent, Inc.
May 1, 2014
http://siliconangle.com/blog/2014/05/01/invest-more-in-metadata-to-make-more-of-your-data/

Companies spend millions of dollars to get an edge from the data they own. However, all too often their efforts are out of balance. Data is hoarded without a clear purpose, and not nearly enough time is invested in capturing and analyzing the data about the data: the metadata. The fact of the matter is that metadata is a valuable tool that can answer as many questions as the data itself.

Metadata is essentially the lever that amplifies the value of data. It provides context around factors including the amount of data processed, the amount of data read and written, the data’s source, destination, and algorithms used to analyze it, the number of data versions in existence, and those versions that are used most often.

As enterprises move from ad-hoc development to operationalizing data to building teams to create and maintain a continual flow of big data applications, recognizing the potential of metadata is crucial. Metadata can provide valuable business insights to constituents of a team, an organization, and a CEO so that each player can do their job better. This is particularly true as metadata is often derived from the context or manner in which the data is used, shedding further light on who or what used the data and in what way the data provided value.

With a rich set of metadata, you can zoom into the details of your data or zoom out to see the bigger picture—all to gain insight into how your business is running. Whether your role is in compliance, operations, or application development, metadata is critical to leveraging your data. Anyone looking to find the business value in data can refer to real-world examples of metadata’s value, such as those described below, and consider how to collect and exploit it in their own organizations.

Metadata: More Valuable than the Data Itself

In almost all conspicuous data victories (both popular and mundane), data has been used in conjunction with metadata.

Facebook is a great example of a data company that is deriving billions in revenue from use of metadata. While the company receives terabytes of data per minute, it certainly isn’t reading posts to find out if you like Coca-Cola or if you’re in the market to switch car insurance. Rather, Facebook leverages its data on a deeper level, looking at what you like (or stop liking), the brands you engage with, the quizzes you take, the social games you play and the apps you use. In turn, Facebook can create a profile based on user behavior—a metadata profile that it monetizes through eerily optimized ads.

Data is organic; it ebbs, flows and oozes through an organization. Capturing its navigation points, the details of its every stop, as well as details about the people and systems that manipulated it, will tell you as much or more about your business as the data itself. People navigate themselves to what’s most useful (or in the case of Facebook, what they consider most valuable or interesting). Metadata can capture that ebb and flow, and by analyzing it, you can gain insight into how your data is used, which is often more interesting than the data itself.

In turn, the imperative is to collect that metadata gold and use it to supplement other sources of user research, such as focus groups or polls. Metadata brings into focus how your data is being leveraged, where it’s being leveraged, whether or not your resources are being used efficiently, and ultimately what’s important.

Big data almost always represents some micro level of action—the phone call, the Facebook Like, the download, the click—but that micro-level data alone offers only an incomplete story. There’s nothing compelling about a record of four or five credit card transactions in isolation, but there’s something enormously telling about metadata when it shows that these transactions took place in five different states within the same two-hour period. Metadata moves data from a micro to a summary level, which can then become the raw materials for building a model to extract meaning.

Simply put, metadata enables you to gain broader and deeper insights by looking at the usage and summary of your data. As metadata surrounds raw data, it sheds light on a wider sphere of activity, thereby expanding the context of analysis. The result is that the model of the customer, the process or other interactions becomes richer and can tell us more about the past and the future.

One example of metadata in action is Amazon.com’s anticipatory shipping. By watching how customers interact with items in their carts, Amazon has a pretty good idea when someone is going to make a purchase. The signals in the metadata (viewing the item, reading reviews, going back to the page, interacting with the shopping cart) provide enough assurance to support moving the item in question to a warehouse near the customer. That practice is not exclusive to Amazon, and given web logs, ecommerce metadata is certainly there for the taking.

We’re All in the Metadata Business

In “Using Metadata to Find Paul Revere,” Kieran Healy, a professor of sociology at Duke University, showed how the British Crown could have used metadata available at the time to identify Paul Revere as a revolutionary. On a more amorous note, UCLA math student Chris McKinlay used metadata to find a compatible woman through OkCupid.

In financial services, there’s a governance, risk and compliance angle to metadata. At a recent banking tech conference, one of the speakers voiced the need for granular details about metadata. Banks are under tremendous pressure to comply with new and ever changing regulations. In some cases, banks are required to explain exactly how they derived their analytical results, answering questions that include: Where did the source data come from? Did the processing use a join, a filter or a merge? What algorithm was used? Which predictive model? How many versions of the data exist? Which data set was ultimately used to derive the result?

Here in Silicon Valley, new tech startups are building products to help organizations and consumers make sense of their metadata, and improve their businesses and lives through their use of data. The potential for metadata to support better operations, better personal well being and better fidelity is unlimited.

Fitbit, Jawbone and Nike Fuel all track what we do, and also when and where we do it—expanding raw data from an accelerometer to generate reminders to exercise and offering analysis on the quality of our sleep. The Nest acquisition by Google and other investments in the Internet of Things movement are motivated not only by the value of such core businesses, but more importantly by the value and ability of sensors to provide a better understanding of how people live. In the manufacturing realm, ThingWorx was just acquired for its ability to create larger models and advanced automation systems out of metadata provided by sensors and industrial equipment.

As I’ve said before, we’re all in the data business. Of course Facebook and Twitter are in the data business. However, if you’re using data to gain insights and drive decisions, then you too are monetizing your data, and you too are in the business of data. Never mind about the elegance and effectiveness of your data repositories; the way to fully exploit and monetize that data is to build a smarter organization, to inform the models used to run your business and to increase the scope of what you know. That potential is powered by metadata.

About the Author

Gary Nakamura is the CEO of Concurrent, Inc. He joined Concurrent in January 2013 to lead Concurrent through its next phase of growth. Gary has a highly successful track record including significant contributions to the explosive growth of Terracotta where he was SVP & General Manager.