Category Archives: News

Cascading: Open Source Java App Framework for Big Data

Cascading: Open Source Java App Framework for Big Data
John K. Waters, Application Development Trends
December 4, 2013
http://adtmag.com/blogs/watersworks/2013/12/cascading-big-data.aspx

Enterprise interest in Big Data and associated analytics software has sparked intense interest in Apache Hadoop, the open source framework for running applications on large data clusters built on commodity hardware, and something of a flood of tools for developers working with it. But as an applications market emerges in this space, the next Big Thing for Big Data is likely to be app-oriented middleware.

That’s an insight Tony Baer, principal analyst at Ovum, shared with me when I talked with him recently about Continuuity’s recent Reactor 2.0 release, which the Java toolmaker billed as the first scale-out application server for Apache Hadoop.

“It is inevitable that applications will be developed that run against Big Data,” Baer said, “and as that occurs, it will be necessary to have an application layer that allows developers with Java and other languages to develop apps that run against it.”

Baer’s prediction makes perfect sense, and it’s one reason Java jocks might want to keep an eye on Concurrent, the company behind the open source Cascading project. Cascading is a Java application development framework for rich data analytics and data management apps running across “a variety of computing environments,” with an emphasis on Hadoop and API compatible distributions.

“Big Data is moving to the next phase of maturity and it’s all about the applications,” the company says on its Web site. “The applications process the data and extract the value at scale and we believe that there must be a simple, reliable and consistent way to build, deploy, run and manage these data driven applications.”

Great minds.

Concurrent characterizes Cascading as “a rich Java API for defining complex data flows and creating sophisticated data oriented frameworks,” and it claims more than 110,000 user downloads a month. Its published user list includes Twitter, eBay, Square and Etsy, among others.

The San Francisco-based company recently announced Cascading 2.5 with new support for Hadoop 2 and YARN, the next-gen Hadoop data processing framework (sometimes called MapReduce 2.0).

Chris Wensel, Concurrent’s founder and CTO, has argued that developing and building applications on Hadoop has proven to be difficult, despite the framework’s rapid enterprise adoption. “With Hadoop 2, the community has addressed many concerns, paving a clearer path for enterprise users,” he said in a statement. “At Concurrent, we’re dedicated to forging a simpler path to mass Hadoop adoption by delivering a framework for building powerful and reliable data-oriented applications supporting data driven business models — quickly and easily. Our support for Hadoop 2 was an easy decision, as we continue to be an integral part of the Hadoop and Big Data ecosystem, providing solutions that simplify application development and management for the enterprise.”

As a Java-based framework, Cascading fits naturally into JVM-based languages, including Scala, Clojure, JRuby, Jython and Groovy. And the Cascading community has created scripting and query languages for many of these languages. The company’s extensions page offers a growing list of user contributed code.

Cascading 2.5 is publicly available and freely licensable under the Apache 2.0 License Agreement.

Big Data: Interacting with Hadoop 2 Data Using Standard ANSI SQL

Big Data: Interacting with Hadoop 2 Data Using Standard ANSI SQL
Dick Weisinger, formtek
November 19, 2013
http://formtek.com/blog/big-data-interacting-with-hadoop-2-data-using-standard-ansi-sql/

Cascading is an Apache-licensed application framework for building rich data processing and machine learning applications that run on Hadoop. Cascading applications are built using a simple API that can be called from any JVM-based language. Over the past five years, it has been developed and supported commercially by Concurrent, Inc. The flow of Cascading is to first capture data from ‘sources’, to then pass that data through ‘pipes’ where it is processed, and finally to push the results into output files or ‘sinks’. This flow of data is known as the ‘source-pipe-sink’ paradigm. Cascading runs as an abstraction at a higher level than MapReduce, so that while Cascading applications ultimately execute MapReduce jobs, when the Cascading application is written, no explicit interactions with MapReduce need to be programmed.

Chris Wensel, Founder and CTO of Concurrent, said that “building applications on Hadoop, despite its growing adoption in the enterprise, is notoriously difficult. We are driving the future of application development and management on Hadoop, by allowing enterprises to quickly extract meaningful information from large amounts of distributed data and better understand the business implications. We make it easy for developers to build powerful data processing applications for Hadoop, without requiring months spent learning about the intricacies of MapReduce.”

More than 110,000 user downloads of Cascading are made every month.  Cascading is used by businesses like Twitter, eBay, The Climate Corporation, Square and Etsy for managing some or all of their Big Data requirements. In fact, all of Twitter’s revenue-generating applications have been built with Cascading.

Today, Cascading 2.5 is being introduced — a version-number jump from the previously available 2.2 point release. The jump in numbering was intended to emphasise the significance of some of the new features in the release. Most significantly, the 2.5 release will include support for Hadoop 2 and YARN.  Other highlights of the new 2.5 Cascading release include:

  • Performance improvements for complex join operations and optimizations to dynamically partition and store processed data more efficiently on HDFS.
  • Broad compatibility with other Hadoop vendors and Hadoop as a service providers, including Cloudera, Hortonworks, MapR, Intel, Altiscale, Qubole and Amazon EMR

Coincident with the release of Cascading 2.5, Concurrent is also making another product, Cascading Lingual, generally available. The Lingual product is an add-on to Cascading that enables a complete ANSI-SQL interface for interacting with Hadoop data. Compatibility with standard SQL means that SQL developed in traditional relational databases can be brought over and used as-is within Lingual. Like Cascading, Lingual comes with the Apache 2.0 license.

Concurrent described the benefits of the Lingual product by saying that “Cascading Lingual provides out-of-the-box support for JDBC. Enterprises that have invested millions of dollars in business intelligence (BI) tools, such as Pentaho, Jaspersoft and Cognos, and training can now also access their data on Hadoop through standard SQL interface.”

André Kelpe, software engineer for Concurrent, summarized three of the design goals for the Cascading Lingual product:

  • Enable immediate ANSI SQL query access to data
  • Simplified System and Data Integration with read/writes from hdfs, jdbc, memcached, HBase, and redshift
  • Simplified migration of existing SQL within Cascading

A YouTube on-line demo of Cascading Lingual can be found here.

Concurrent Seeks to Lower Hadoop Adoption Barrier for SQL Coders

Concurrent Seeks to Lower Hadoop Adoption Barrier for SQL Coders
David Ramel, Application Development Trends Magazine
November 21, 2013
http://adtmag.com/articles/2013/11/21/concurrent.aspx

Concurrent Inc. this week released Cascading Lingual, an open source project designed to give developers a standards-based SQL interface for creating and working with Big Data applications on Apache Hadoop. It works with the company’s Cascading application framework used by Java developers for building Hadoop analytics and data management apps.

Concurrent also announced an upcoming upgrade of that framework, Cascading 2.5, which includes compatibility with the recent Hadoop 2 upgrade featuring long-awaited support for YARN.

Cascading Lingual joins the growing list of projects aimed at lowering the Hadoop adoption barrier by making Hadoop Big Data application development more accessible to SQL-savvy developers and integrating existing legacy systems with Hadoop applications. “Cascading Lingual enables virtually anyone familiar with SQL to instantly work with data stored on Hadoop using their JDBC-compliant BI [business intelligence] or desktop tool of choice,” the company said. “Enterprises benefit as they can execute on Big Data strategies using existing in-house resources, skills sets and product investments.”

Developers have the option of using standard Java JDBC interfaces to build apps or Concurrent’s own Cascading APIs that facilitate solutions built with ANSI-standard SQL and custom code written in Java, Scala or Clojure.

Along with the standards-based SQL support and JDBC driver, Cascading Lingual features an interactive SQL Shell command interface for interacting with Hadoop, for example through the execution of SQL commands. A Catalog command-line tool is provided for curating database tables that map to Hadoop files and other resources. Another key feature of Cascading Lingual is a Data Provider mechanism that allows for simultaneous data queries from multiple external data stores with just one SQL command, Concurrent said. Developers can just “cut and paste” existing SQL code to work with Hadoop data or migrate applications to Hadoop clusters.

The company said Cascading Lingual was created via collaboration between the developers of the Cascading Java API and developers of Optiq, a dynamic data management framework and SQL parser originally written by Julian Hyde, who also authored the Java-based Mondarian Online Analytical Processing (OLAP) engine.

The upcoming 2.5 release of the Cascading application framework upon which Lingual is based provides developers with the new capabilities of Hadoop 2. Released last month, Hadoop 2 supports YARN–sometimes referred to as “Yet Another Resource Negotiator”–a long-anticipated upgrade to the MapReduce job batch processing framework and core component of Hadoop. Concurrent said Cascading users will now be able to seamlessly upgrade their applications to Hadoop 2. “Furthermore, Big Data applications using domain specific languages (DSLs), such as the widely used Scalding (Scala on Cascading), Cascalog (Clojure on Cascading) and PyCascading (Jython on Cascading) languages, will also seamlessly migrate to Hadoop 2,” the company said.

Concurrent said the framework is used by companies such as Twitter, eBay and Trulia “to streamline data processing, data filtering and workflow optimization for large volumes of unstructured and semi-structured data.”

Cascading 2.5 reportedly will be available for download soon under the open source Apache 2.0 License Agreement. Cascading Lingual is available for download now.

Cascading 2.5 Supports Hadoop 2

Cascading 2.5 Supports Hadoop 2
Boris Lublinsky, InfoQ
November 19, 2013
http://www.infoq.com/news/2013/11/cascading

Despite a wide and growing adoption of Hadoop, enterprises are still facing the problem of finding the right approach for fast and cost effective development of Hadoop-based applications. One of the ways to achieve this goal is using Domain Specific Languages (DSL) that often allow for significant simplification of Hadoop implementations.

One of the most popular Java DSL on top of the low-level MapReduce API is Cascading. It was introduced in late 2007 as a DSL to implement functional programming for large scale data workflows. It is based on a “plumbing” metaphor to define data processing as a workflow built out of familiar elements: Pipes, Taps, Tuple Rows, Filters, Joins, Traps, etc.

Cascading introduced a new version of the product – Cascading 2.5 – this week, delivering support for Hadoop 2, including YARN. According to the company’s press release, this new release features:

  • Support for Hadoop 2 and its new features, including YARN. Cascading users looking to upgrade to Hadoop 2 will now be able to seamlessly migrate their applications and take advantage of new advance features like YARN.
  • Added performance improvements for complex join operations and optimizations to dynamically partition and store processed data more efficiently on HDFS.
  • Additional broad compatibility with other Hadoop vendors and Hadoop as a service providers, including Cloudera, Hortonworks, MapR, Intel, Altiscale, Qubole and Amazon EMR, among others give Cascading users a richer set of deployment options and services available to them, whether on-premise or in the cloud.

Simultaneously, Concurrent announced general availability of Cascading Lingual – an open source project that provides a comprehensive ANSI SQL interface for accessing Hadoop-based data. This project covers more than 7,000 SQL-99 statements derived from sophisticated industry standard OLAP tools, and according to Cascading:

delivers the broadest SQL coverage for any tool in the Hadoop ecosystem. It’s innovative by making Hadoop simple and accessible, and by providing easy systems integration for multiple data stores into Hadoop by using just one SQL statement.

InfoQ had a chance to discuss the latest Cascading release with Chris K Wensel, CTO and Founder of Concurrent, Inc.

InfoQ: When you are talking about supporting YARN in Cascading 2.5, what do you mean exactly – the fact that MapReduce code uses the YARN resource manager or are you actually leveraging YARN for creating a new application manager, one specific to Cascading?

Wensel: Cascading 2.5 implicitly supports YARN, meaning that because Cascading 2.5 supports Hadoop 2, YARN functionality will also be supported. Cascading does not actually leverage YARN for application development.

InfoQ: Do you have any plans for leveraging Apache Tez to further improve the performance of Cascading applications?

Wensel: Yes, we do. We have plans for Tez in our roadmap and will communicate those updates at the appropriate time.

InfoQ: Can you elaborate on optimizations and performance improvements for complex joins?

Wensel: We have updated the API to allow for more complex and custom join types. Cascalog, for example, would leverage this feature under some circumstances.

InfoQ: In your opinion, is the current emphasis on SQL-based processing limiting the spectrum of applications developed in Hadoop? In spite of all its power, SQL is good for solving only a certain class of problems, representing a limited subset of applications for which enterprises can leverage Hadoop.

Wensel: SQL allows the other 99% of developers, analysts, and legacy systems to leverage Hadoop. Yes, you may hit a wall with SQL, but 90% of most problems are reasonably expressed as SQL. Cascading allows you to choose your battles. A good question to ask is: who wants to write a bunch of Java to do something that is one line of SQL? And also, who wants to write hundreds of lines of SQL to do something best written and tested in Java? Cascading gives developers the flexibility.

Concurrent Announces Release Of Cascading 2.5 and Lingual 1.0 To Simplify Application Development Using Hadoop

Concurrent Announces Release Of Cascading 2.5 and Lingual 1.0 To Simplify Application Development Using Hadoop
Arnal Dayaratna, Ph.D., Cloud Computing Today
November 19, 2013
http://cloud-computing-today.com/2013/11/19/58784/

Today, Concurrent elaborates on the release of Cascading 2.5, the open source framework for facilitating the development of applications on Apache Hadoop. Cascading 2.5 supports the recent released Hadoop 2.0 distribution including YARN and its other features. Cascading users that are interested in upgrading to Hadoop 2.0 can do so by means of Cascading 2.5. Similarly, applications that leverage the Scalding, Cascalog and PyCascading languages can migrate to Hadoop 2.0 as well by means of the Cascading 2.5 framework. The latest release of Cascading also features “complex join operations and optimizations to dynamically partition and store processed data more efficiently on HDFS,” according to the Concurrent’s press release. Finally, the release deepens its compatibility with other Hadoop distributions and Hadoop as a Service vendors such as Cloudera, Hortonworks, MapR, Intel, Altiscale, Qubole and Amazon EMR.

Cascading 2.5 represents one of the few products in either the commercial or open source ecosystem for simplifying the development of Hadoop applications while integrating with a rich and varied ecosystem of products as illustrated below:

concurrent-cascading

The graphic shows how Cascading 2.5 supports all major Hadoop distributions in addition to an impressive list of development languages, database platforms and cloud platforms. In an interview with Cloud Computing Today, Concurrent CEO Gary Nakamura and CTO Chris Wensel noted the uniqueness of Cascading in the Big Data landscape, particularly given its iterative refinement in collaboration with the likes of Twitter, eBay and The Climate Corporation over a period of more than five years.

Today’s announcement regarding the general availability of Cascading 2.5 is accompanied by news of the general availability of Lingual, an ANSI-compliant SQL interface that allows developers to use SQL commands to query data stored in Hadoop clusters. Unlike Apache’s Hive project, Lingual’s ANSI-standard SQL interface enables developers to deploy authentic SQL commands as opposed to HIVE’s SQL-like syntax. Cascading Lingual also allows for the migration of legacy SQL workloads onto Hadoop clusters, the export of Hadoop data onto BI tools such as Jaspersoft, Pentaho and Talend, and the ability to leverage the power of Cascading in conjunction with SQL to orchestrate the execution of multiple SQL queries instead of several, discrete disparate queries. The Big Data space should expect more from Concurrent as it continues to build out tools for simplifying application development on Hadoop, particularly as more and more Hadoop developers come to terms with Cascading’s advantages over MapReduce.

Concurrent Builds New Tools for Hadoop Big Data Access

Concurrent Builds New Tools for Hadoop Big Data Access Christopher Tozzi, The VAR Guy November 19, 2013 http://thevarguy.com/big-data-technology-solutions-and-information/concurrent-builds-new-tools-hadoop-big-data-access

The channel is continuing to bring the power of Hadoop, the open source platform for Big Data, up to speed with the database and storage tools necessary to leverage that power. This week, Concurrent in unveiling two major software releases with important implications for Big Data analytics and application programming. Concurrent’s professed goal is to build “the most advanced software platform for Enterprise Big Data applications.” Its most recent new software platform, called Cascading Lingual and introduced Nov. 19, works toward that end by simplifying the interface between Hadoop data and the applications, such as business intelligence (BI) tools, that rely on that data. Concurrent is billing Cascading Lingual as “Hadoop for Everyone Else,” and says the platform:

enables virtually anyone familiar with SQL to instantly work with data stored on Hadoop using their JDBC compliant BI or desktop tool of choice. Enterprises benefit as they can execute on Big Data strategies using existing in-house resources, skills sets and product investments. Cascading Lingual drives improved enterprise productivity, time-to-market benefits and the deployment of a sane and maintainable Big Data strategy.

Also on Nov. 19, Concurrent unveiled version 2.5 of its Cascading platform, an application framework for building Java programs that leverage Hadoop. In addition to performance improvements, the update brings support for Hadoop 2—and with it YARN, a major innovation in Hadoop that represents a complete rewrite of the resource manager, formerly known as MapReduce. Cascading 2.5 also strengthens Concurrent’s channel presence by making the platform more compatible with the Hadoop tools offered by other vendors, including ClouderaHortonworksMapRIntel (INTC), Altiscale,Qubole and Amazon (AMZN) Elastic Map Reduce, the company said. Viewed broadly, Concurrent’s recent innovations, including similar advances in next-generation storage technology, reflect the drive to make Hadoop more usable. In some senses, it’s surprising that this remains an ongoing effort, since Hadoop has been around now for some time. But with Hadoop 2 out as of October, the channel has a new imperative to build solutions in this space. Stay tuned.

Concurrent Tools Up For Hadoop 2, Hadoop for Everyone Else

Concurrent Tools Up For Hadoop 2, Hadoop for Everyone Else
Steve Wexler, IT Trends & Analysis
November 19, 2013
http://it-tna.com/2013/11/19/concurrent-tools-up-for-hadoop-2-hadoop-for-everyone-else/

Big Data and Apache Hadoop (High-availability distributed object-oriented platform) are racing towards respectability, and Concurrent wants to both accelerate the journey and make it easier with a couple of announcements, Cascading 2.5 and Cascading Lingual. Despite its rapid enterprise adoption, developing and building applications on Hadoop has proven to be difficult, stated Chris Wensel, Concurrent founder and CTO, who added the company’s focus is forging a simpler path to mass Hadoop adoption by delivering a framework for building powerful and reliable data-oriented applications supporting data driven business models – quickly and easily.

“Hadoop is a mess, a total mess,” said CEO Gary Nakamura. “It’s whack-a-mole as far as solving problems.” Hence the company’s focus to make it easier for mainstream enterprises to build these applications and continue on with their Big Data strategy, he said.

Concurrent, which bills itself as the enterprise Big Data application platform company, said the new release of Cascading features support for the recently announced Hadoop 2 (October GA), including YARN (MapReduce 2.0, MRv2). With Cascading Lingual (Hadoop for Everyone Else), an open source project that provides ANSI compatible SQL, enterprises with business intelligence (BI) tools such as Pentaho, Jaspersoft and Cognos, can now access their data on Hadoop in a matter of hours, rather than weeks, The new offerings provide a number of benefits, said Wensel, including making it easier to migrate applications to Hadoop, and the ability to create Cascading application on the fly.

According to Gartner, Big Data is still making its way up to the Peak of Inflated Expectations, but Hadoop is well on its way to the bottom of the Trough of Disillusionment. That means while there is still a lot of hype and confusion about Big Data is quickly moving to being a useful and more widely adopted technology.

It’s still early days for Big Data and analytics, but the future looks very bright. With organizations drowning in data, and missing out on opportunities, real-time operational intelligence systems are moving from ‘nice to have’ to ‘must have for survival’, said Gartner. It predicts that analytics will reach 50% of potential users by 2014, and by 2020, that figure will be 75%. Post 2020 we’ll be heading toward 100% of potential users and into the realms of the Internet of Everything.

Hadoop may not be the only way to address Big Data, but it is growing strongly, a compound annual growth rate of 55.63% between 2012-2016. Hadoop-as-a-Service is growing even faster, with a CAGR of 95.16% during this period. The Hadoop market is expected to reach $13.9 billion by 2017, with North America, which accounted for 53.85% of the overall market in 2012, leading the way.

A new IDC study, commissioned by Red Hat, found that virtually all of the companies (99%) surveyed have either deployed or plan to deploy Hadoop. A third (32%) have already made a Hadoop deployment, 31% intend to deploy Hadoop in the next 12 months, and 36% plan to use a Hadoop deployment in more than a year.

“We’re very cognizant that Hadoop is not the be all and end all of Big Data,” said Nakamura. Going forward, Concurrent plans to support the different computational platforms, with another platform offering already in the works, he said.

Under The Hood

To be publicly available soon and freely licensable under the Apache 2.0 License Agreement, Cascading 2.5 features and benefits include:

  • support for Hadoop 2 and its new features, including YARN, so Cascading users can upgrade to Hadoop 2 and seamlessly migrate their applications; in addition, Big Data applications using domain specific languages (DSLs), such as Scalding (Scala on Cascading), Cascalog (Clojure on Cascading) and PyCascading (Jython on Cascading) languages, will also seamlessly migrate to Hadoop 2;
  • added performance improvements for complex join operations and optimizations to dynamically partition and store processed data more efficiently on HDFS; and,
  • additional broad compatibility with other Hadoop vendors and Hadoop as a service providers, including Cloudera, Hortonworks, MapR, Intel, Altiscale, Qubole and Amazon EMR.

Called Hadoop for Everyone Else, Cascading Lingual enables virtually anyone familiar with SQL to instantly work with data stored on Hadoop using their JDBC compliant BI or desktop tool of choice, according to Concurrent. Offering a true ANSI-standard SQL interface, it is compatible with all major Hadoop distributions whether on-premise or in the cloud. Use-case examples include:

  • data analysts, scientists and developers can now simply ‘cut and paste’ existing ANSI SQL code to instantly access data locked on or migrate applications to a Hadoop cluster;
  • developers can use a standard Java JDBC interface to create new Hadoop applications, or use the Cascading APIs to build applications with a mix of SQL and custom Java, Scala or Clojure code; and,
  • companies can now query and export data from Hadoop directly into traditional BI tools.

Cascading 2.5 gets Lingual

Cascading 2.5 gets Lingual
Alex Handy, SD Times
November 19, 2013
http://www.sdtimes.com/content/article.aspx?ArticleID=66386&page=1

Concurrent has updated its flagship, open-source project, Cascading. This Hadoop development library gives developers a way to separate their business logic from the rest of their Hadoop code. The result is that Cascading 2.5, released today, is now able to interface with multiple versions of Hadoop, and to export data from a Hadoop cluster using a SQL query.

Chris Wensel, creator of Cascading and CTO of Concurrent, said that Cascading brings a more familiar development model to the Hadoop world. “Cascading is a Java library that adds two key core components. It allows you to isolate your business logic and do tests in a model you’re familiar with. And it has an alternative API of MapReduce, though it uses MapReduce under the hood. You can focus on business logic by simply reading and writing files,” he said.

In order to expand the capabilities of Cascading, version 2.5 adds Lingual. Lingual executes ANSI SQL as Cascading applications across a Hadoop cluster. While these queries don’t return as fast as a SQL query into a relational database, they do allow developers to use SQL to pull data out of Hadoop.

Wensel is clear that this SQL support for Hadoop is not intended to be competition for Greenplum or other Big Data analysis systems that allow large-scale SQL queries across Big Data sets. Instead, he said, “We’re being honest and saying Hadoop is great for migrating workloads without low-latency SLAs. Hadoop is glue: It’s good at integrating systems and working reliably.”

Lingual also comes with a JDBC driver, allowing developers to treat Hadoop as a standard Java-accessible SQL-addressable data source.

Cascading 2.5 also uncouples the Hadoop-specific code from the actual Cascading functionality. In version 2.0, this manifested as the ability to run Cascading code in memory, without Hadoop. In version 2.5, this capability means that Cascading code can run on Hadoop 1.x and Hadoop 2.x without modification.

Big Data Application Framework Gets Update, SQL Interface

Big Data Application Framework Gets Update, SQL Interface
Thor Olavsrud, CIO Magazine
November 19, 2013
http://www.cio.com/article/743423/Big_Data_Application_Framework_Gets_Update_SQL_Interface

Open source big data application platform specialist Concurrent has released a new version of the Cascading application framework and simultaneously released Cascading Lingual 1.0, an ANSI SQL interface for Hadoop.

Building on last month’s release of Apache Hadoop 2.2, big data application platform specialist Concurrent today released a new version of Cascading, its big data application framework.

Concurrent also announced the general availability of Cascading Lingual 1.0, an open source project that provides a comprehensive ANSI SQL interface.

Cascading is a stand-alone open source Java application framework designed as an alternative API to MapReduce. Cascading gives Java developers the capability to build big data applications on Hadoop using their existing skillset.

“I created Cascading in anger after having used MapReduce once in my life and vowing never to use it again,” says Chris Wensel, creator of Cascading and founder and CTO of Concurrent.

The latest release, Cascading 2.5 adds support for Hadoop 2.2, including the new YARN architecture introduced in that version of Hadoop. Apache Hadoop YARN (Yet Another Resource Negotiator) serves as the Hadoop operating system, taking what was a single-use data platform for batch processing and evolving it into a multi-use platform that enables batch, interactive, online and stream processing.

YARN acts as the primary resource manager and mediator of access to data stored in Hadoop Distributed File System (HDFS), giving enterprises the capability to store data in a single place and then interact with it in multiple ways, simultaneously, with consistent levels of service.

Enterprises can now use Cascading to leverage Java, legacy SQL and predictive modeling investments for a single big data processing application.

Migration Path to Hadoop 2

Gary Nakamura, CEO of Concurrent, says that Cascading doesn’t leverage YARN specifically, but does enable users to seamlessly migrate their applications to Hadoop 2 and take advantage of YARN. Domain specific languages (DSLs) like Scalding, Cascalog and PyCascading also seamlessly migrate to Hadoop 2. Similarly, Cascading will support Apache Tez when it takes its place in the Hadoop stack.

Concurrent has also added performance improvements for complex join operations and optimizations to dynamically partition and store processed data more efficiently on HDFS.

In addition to Cascading, Concurrent announced the immediate availability of Cascading Lingual 1.0, intended to help enterprises that have already invested heavily in business intelligence (BI) tools like Pentaho, Jaspersoft and Cognos—and the training to go with them—to quickly access their data on Hadoop. Lingual allows users to utilize their existing SQL skills and systems to create and run applications on Hadoop.

Concurrent’s Wensel says Lingual empowers just about anyone familiar with SQL to instantly work with data stored on Hadoop using their JDBC-compliant BI or desktop tool of choice.

“Cascading is an important component to the big data application development ecosystem, and Lingual is another step forward in making it significantly easier to build big data apps,” says Steve McPherson, group manager, Amazon Elastic MapReduce (EMR) at Amazon Web Services (AWS).

“Now, Amazon Elastic MapReduce customers can leverage Lingual to integrate disparate data stores on Amazon Web Services with services such as Amazon S3 and Amazon Redshift, and they can process the data and store it in Amazon EMR through one standard ANSI SQL statement,” McPherson says. “This makes it easier for customers to query data with their favorite BI tool.”

Dealing with Big Data

Dealing with Big Data
By Amber E. Watson
November 15, 2013
http://www.softwaremag.com/content/ContentCT.asp?P=3457

Collecting, mananging, processing, and analyzing big data is critical to business growth.

Big data holds the key to the critical information an organization needs to attract and retain customers, grow revenue, cut costs, and transform business.

According to SAS, big data describes the exponential growth, availability, and use of structured and unstructured information. Big data is growing at a rapid rate, and it comes from sources as diverse as sensors, social media posts, mobile users, advertising, cybersecurity, and medical images.

Organizations must determine the best way to collect massive amounts of data from various sources, but the challenge lies in how that data is managed and analyzed for the best return on investment (ROI). The ultimate goal for organizations is to harness relevant data and use it to make the best decisions for growth.

Technologies today not only support the collection and storage of large amounts of data, but they also provide the ability to understand and take advantage of its full value.

Challenges of Collecting Big Data
A common description of big data involves “the four V’s,” including “the growing volume, which is the scale of the data; velocity—how fast the data is moving; variety—the many sources of the data; and veracity—how accurate and truthful is the data,” explains James Kobielus, big data evangelist, IBM.

Organizations strive to take advantage of big data to better understand the customer experience and behavior, and to tailor content to maximize opportunities.

Steve Wooledge, VP of marketing, Teradata Unified Data Architecture, notes that gaining insights from big, diverse data typically involves four steps including data acquisition, data preparation, analysis, and visualization.

“One of the key challenges in collecting big data is the ability to capture it without too much up-front modeling and preparation,” he says. “This becomes especially important when data comes in at a high velocity and variety due to different types of data—for example, Web logs generated from visitor activity on high-traffic ecommerce Web sites, or text data from social media sites.”

Separate, disconnected silos of data also increase the time and effort required for data analysis, which is why integrated and unified interfaces are preferred.

Webster Mudge, senior director of technology solutions, Cloudera, agrees that the variety of big data presents a particular issue because a disproportionate amount of data growth experienced by organizations is with unstructured or variable structured information, such as sensor and social media content.

“It is difficult for business and IT teams to fully capture these forms of data in traditional systems because they do not typically fit cleanly and efficiently with traditional modeling approaches,” he points out. “Thus, organizations have trouble extracting value from these forms due to model constraints and impedance.”

With a large volume of incoming data, transferring it reliably from sources to destinations over a wide-area network is also challenging. “Once data arrives at the destination, it is often logged or backed up for recoverability. The next step is ingesting large volumes of data in the data warehouse. Because loading tends to be continuous, the biggest challenge is ingesting and concurrently allowing reporting/analytics against the data,” states Jay Desai, co-founder, XtremeData Inc.

Mike Dickey, founder/CEO, Cloudmeter, explains that traditionally, network data was captured with intrusive solutions that modified the source code of Web applications (apps). “Companies either add extra code that runs within Web browsers, known as ‘page tags,’ or modify the server code to generate extra log files,” he notes.

“But both approaches restrict companies to capture only a subset of the information available because the excess code introduces significant latency, harming performance, and degrading customer experiences. It is also a nightmare on the back end for IT to stay on top of dynamic, ever-changing sites.”

To get the best use of one’s data, George Lumpkin, VP of product management, big data and data warehousing, Oracle, advises an organization to first understand what data it needs and how to capture it.

“In some cases, an organization must investigate new technologies like Apache Hadoop or NoSQL databases. In other cases, it may be required to add a new kind of cluster to run the software, increase networking capacity, or increase speed of analytics. If the use case involves real-time responses on streaming data—as opposed to more batch-oriented use—there are tools around event processing, caching, and decision automation,” he says.

These considerations, of course, must be integrated with the existing information infrastructure. “Big data should not exist in a vacuum,” says Lumpkin. “It is the combination of new data and existing enterprise data that yields the best new insight and business value.”

With new technology comes the need for new skills. “There is an assumption that since Hadoop is open source and available for free, somebody should just download it and get started. While the early adopters of Hadoop did take this approach, they have significantly developed their skill sets over time to install, tune, use, update, and support the software internally, efficiently, and effectively,” shares Lumpkin.

The key to realizing value with big data is through analytics, so organizations are advised to invest in new data science skills, adopting and growing statistical analysis capabilities.

Big Data Management
When it comes time to analyze big data, the analytics are only as good as the data. It is important to think carefully about how data is collected and captured ahead of time. Sometimes businesses are left with information that doesn’t correlate, and are unable to pull the analytics needed from the information gathered.

Data preparation and analysis is a major undertaking. “Data acquisition and preparation can itself take up 80 percent of the effort spent on big data if done using suboptimal approaches and manual work,” cautions Teradata’s Wooledge.

“One key challenge of managing data is preparing different types of data for analysis, especially for multistructured data. For example, Web logs must first be parsed and sessionized to identify the actions customers performed and the discrete sessions during which customers interacted with the Web site. Given the volume, variety, and velocity of big data, it is impossible to prepare this data manually,” he adds.

To solve a business problem effectively, Wooledge advises organizations to use an iterative discovery process and combine different analytic techniques to solve business problems. “Discovery is an interactive process that requires rapid processing of large data volumes in a scalable manner. Once a user changes one or more parameters of the analysis and incorporates additional data using a discovery platform, they are able to see results in minutes versus waiting hours or days before they are able to do the next iteration.”

In addition, many big data problems require more than one analytic technique to be applied at the same time to produce the right insight to solve the problem most effectively. The ability to easily use different analytic techniques on data collected during the discovery process amplifies the effectiveness of the effort.

Cloudera’s Mudge shares that traditional approaches to modeling and transformation assume that a business understands what questions need to be answered by the data. However, if the value of the data is unknown, organizations need latitude within their modeling processes to easily change the questions in order to find the answers.

“Traditional approaches are often rigid in model definition and modification, so changes may take significant effort and time. Moreover, traditional approaches are often confined to a single computing framework, such as SQL, to discover and ask questions of the data. And in the era of big data, organizations require options for how they compute and interpret data and cannot be limited to one approach,” he notes.

After collecting data, an organization is sitting on a large quantity of new data that is unfamiliar in content and format. “For example, a company may have a large volume of social media comments coming in at high velocity; some of those comments are made by important customers. The question becomes how to disambiguate this data or match the incoming data with existing enterprise data. Of all the data collected, what is potentially relevant and useful?” asks Oracle’s Lumpkin.

Next, the company must try to understand how the new data interrelates to determine hidden relationships and help transform business. “A combination of social media comments, geographic location, past response to offers, buying history, and several other factors may predict what it needs. The team just has to figure out this winning combination,” shares Lumpkin.

Data governance and security are also important considerations. Does a company have the right to use the data it captures? Is the data sensitive or private? Does it have a suitable security policy to control and audit access?

Harnessing Big Data
To harness diverse data, outsmart the competition, and achieve the highest ROI, Wooledge advises organizations to include all types of data across customer interactions and transactions.

“Traditional approaches of buying hardware/software do not often work for big data because of scale, complexities, and costs,” adds XtremeData’s Desai. “Cloud-based, big data solutions help companies avoid up-front commitments and have infinite on-demand capacity.”

Paco Nathan, director of data science, Concurrent, Inc., adds, “Companies today recognize the importance of collecting data, aggregating into the cloud, and sharing with corporate IT for centralized analysis and decisions. The ecommerce sector, including Amazon, Apple, and Google, for example, have faired well with big data in terms of marketing funnel analytics, online advertising, customer recommendation systems, and anti-fraud classifiers.”

Another facet to consider is productivity since some of the biggest costs associated with big data are labor and lost opportunity. To mitigate this, Desai recommends that companies look for “load-and-go” solutions/services, meaning they do not require heavy engineering or optimization and can be deployed quickly for business use.

According to IBM’s Kobielus, a differentiated big data analytics approach is needed to keep up with the ever-changing requirements of a business. “Alternative, one-size-fits-all approaches of applying the same analytics to different challenges are flawed in that they do not address the fact that on any given day, a business may need to quickly understand consumer purchases, more effectively manage large financial data sets, and more seamlessly detect fraud in real time,” he says.

A comprehensive, big data platform strategy ensures reliable access, storage, and analysis of data regardless of how fast it is moving, what type it is, or where it resides.

Mudge recommends that data maintain its original, fine-grain format. “By storing data in full fidelity, future interpretation of the data is not constrained by earlier questions and decisions, and associated modeling and transformation efforts. By keeping data in its native format, organizations are free to repeatedly change its interpretation and examination of the original data according to the demands of the business,” he explains.

Cloudmeter’s Dickey points out the importance of having an analytics solution that processes and feeds data in real time. “Since today’s online business operates in real time, it is critical to be able to respond in a timely manner to trends and events that may impact an online business and the customer experience,” he says.

As Kobielus points out, petabytes of fast-moving, accurate data from a variety of sources have little value unless they provide actionable information. It is not just the fact that there is big data, it is what you do with it that counts.

Products to Help
Several key technologies are available to help organizations get a handle on big data and to extract meaningful value from it.

Cloudera – Cloudera’s Distribution Including Apache Hadoop (CDH) is a 100 percent open-source platform for big data for enterprises and organizations. It offers scalable storage, batch processing, interactive SQL, and interactive search, along with other enterprise-grade features such as continuous availability of data. CDH is a collection of Apache projects; Cloudera develops, maintains, and certifies this configuration of open-source projects to provide stability, coordination, ease of use, and support for Hadoop.

The company offers additional products for CDH, including Cloudera Standard—which combines CDH with Cloudera Manager for cluster management capabilities like automated deployment, centralized administration and monitoring, and diagnostic tools—and Cloudera Enterprise, a comprehensive maintenance and support subscription offering.

Cloudera also offers a number of computing frameworks and system tools to accelerate and govern work on Hadoop. Cloudera Navigator provides a centralized data management application for data auditing and access management, and extends governance, risk, and compliance policies to data in Hadoop. Cloudera Impala is a low-latency, SQL query engine that runs natively in Hadoop and operates on shared, open-standard formats. Cloudera Search, which is powered by Apache Solr, is a natural language, full-text, interactive search engine for Hadoop and comes with several scalable indexing options.

Cloudmeter – Cloudmeter captures and processes data available in companies’ network traffic to provide marketing and IT with complete insight into their users’ experiences and behavior. Cloudmeter works with companies whose Web presence is critical to their business, including Netflix, SAP, Saks Fifth Avenue, and Skinit.

Cloudmeter Stream and Cloudmeter Insight use ultralight agents that passively mine big data streams generated by Web apps, enabling customers to gain real-time access into the wealth of business and IT information available without connecting to a physical network infrastructure.

Now generally available, Cloudmeter Stream can be used with other big data tools, such as Splunk, GoodData, and Hadoop. Currently in private beta, Cloudmeter Insight is the first software as a service application performance management product to include visual session replay capabilities so users can see the individual visitor sessions and the technical information behind them.

Concurrent, Inc. – Concurrent, Inc. is a big data company that builds application infrastructure products designed to help enterprises create, deploy, run, and manage data processing applications at scale on Hadoop.

Cascading is used by companies like Twitter, eBay, The Climate Corporation, and Etsy to streamline data processing, data filtering, and workflow optimization for large volumes of unstructured and semi-structured data. By leveraging the cascading framework, enterprises can apply Java, SQL, and predictive modeling investments—and combine the respective outputs of multiple departments into a single application on Hadoop.

Most recently, Concurrent announced Pattern—a standards-based scoring engine that enables analysts and data scientists to quickly deploy machine-learning applications on Hadoop—and Lingual, a project that enables fast and simple big data application development on Hadoop. All tools are open source.

IBM – IBM’s big data portfolio of products includes InfoSphere BigInsights—an enterprise-ready, Hadoop-based solution for managing and analyzing large volumes of structured and unstructured data.

According to IBM, InfoSphere Streams enables continuous analysis of massive volumes of streaming data with submillisecond response times. InfoSphere Data Explorer is discovery and navigation software that provides real-time access and fusion of big data with rich and varied data from enterprise applications for greater insight and ROI. IBM PureData for Analytics simplifies and optimizes performance of data services for analytic applications, enabling complex algorithms to run in minutes.

Finally, DB2 with BLU Acceleration provides fast analytics to help organizations understand and tackle large amounts of big data. BLU provides in-memory technologies, parallel vector processing, actionable compression, and data skipping, which speeds up analytics for the business user.

LexisNexis – High-performing computing cluster (HPCC) Systems from LexisNexis is an open-source big data processing platform designed to solve big data problems for the enterprise. HPCC Systems provides HPCC technology with a single architecture and a consistent data-centric programming language. HPCC Systems offers a Community Edition, which includes free platform software with community support, and an Enterprise Edition, which includes platform software with enterprise-class support.

Oracle – Oracle’s big data solutions are grouped into three main categories—big data analytics, data management, and infrastructure. Big data analytics are covered by Oracle Endeca Information Discovery—a data discovery platform that enables visualization and exploration of information, and uncovers hidden relationships between data, whether it be new, unstructured, or trusted data kept in data warehouses.

Oracle Database, including Oracle Database 12c, contains a number of in-database analytics options. Oracle Business Intelligence Foundation Suite provides comprehensive capabilities for business intelligence, including enterprise reporting, dashboards, ad hoc analysis, multidimensional online analytical processing, score cards, and predictive analytics on an integrated platform. Oracle Real-Time Decisions automates decision management and includes business rules and self-learning to improve offer acceptance over time.

Oracle’s data management tools are used to acquire and organize big data across heterogeneous systems. Oracle NoSQL Database is a scalable key/value database, providing atomicity consistency isolation durability transactions and predictable latency. MySQL and MySQL Cluster are widely used for large-scale Web applications. Oracle Database is a foundation for successful online transaction processing and data warehouse implementations. Oracle Big Data Connectors link Oracle Database and Hadoop. Oracle Data Integrator is used to design and implement extract, transform, and load processes that move and integrate new data and existing enterprise data. Oracle distributes and supports Cloudera’s distribution, including Hadoop.

Lastly, for big data infrastructure, Oracle says its Engineered Systems ship pre-integrated to reduce the cost and complexity of IT infrastructures while increasing the productivity and performance of a data center to better manage big data. Oracle Big Data Appliance is a platform for Hadoop to acquire and organize big data.

SAS – SAS offers information management, which provides strategies and solutions that enable big data to be managed and used effectively through high-performance and visual analytics, and flexible deployment options.

Flexible deployment models bring choice. High-performance analytics from SAS are able to analyze billions of variables, and these solutions can be deployed in the cloud with SAS or another provider, on a dedicated high-performance analytics appliance, or within existing IT infrastructure—whichever best suits an organization’s requirements.

Teradata – Within the Unified Data Architecture, Teradata integrates data and technology to maximize the value of all data, providing new insights to thousands of users and applications across the enterprise.

According to Teradata, its Aster Discovery Platform—which includes the Teradata Aster Database and Teradata Aster Discovery Portfolio—offers a single unified platform that enables organizations to access, join, and analyze multistructured data from a variety of sources from a single SQL interface. The Visual SQL-MapReduce functions and out-of-the box functionality for integrated data acquisition, data preparation, analysis, and visualization in a single SQL statement generate business insights from big, diverse data.

The recently announced Teradata Portfolio for Hadoop provides open, flexible, and comprehensive options to deploy and manage Hadoop.

XtremeData – XtremeData is a petabyte-scale SQL data warehouse for big data analytics on private and public clouds. XtremeData’s “load-and-go” structure means that users implement schema, load data, and start running queries using industry standard SQL tools. XtremeData allows users to freely run queries against multiple large tables and continuously add new data sets without the need for performance engineering.

XtremeData is designed for big data applications that need continuous, real-time ingests and interactive analytics with high availability, such as digital/mobile advertising, gaming, social, Web, networking, telecom, and cybersecurity.

More Data, More Success
While the sheer volume and variety of big data available today presents a challenge for organizations to capture, manage, and analyze, it also holds the key to business growth and success. “Leveraging and analyzing big data enables organizations and industries as a whole to generate powerful, disruptive, even radical transformations,” explains IBM’s Kobielus.

Organizations with access to large data collections strive to harness the most relevant data and use it for optimized decision-making. Big data technologies not only support the ability to collect large amounts of data, but they also provide the ability to understand it and take advantage of its value for better ROI. The more information an organization is able to collect and analyze successfully, the more it has the ability to change, adapt, and succeed long term.

Nov2013, Software Magazine