All posts by admin

Cascading 1.2 Released

We are happy to announce that Cascading 1.2 is now publicly available for download.

This release features many performance and usability enhancements while remaining backwards compatible with 1.0 and 1.1.

Specifically:

  • Performance optimizations during grouping (StreamComparator)
  • Composable map-side partial aggregations (AggregateBy)
  • Native Riffle support for non-Cascading (or nested iterative Cascading) processes (ProcessFlow and Riffle)

For a detailed list of changes see:
CHANGES.txt

We are also happy to announce that Cascading and its extensions have their own Maven/Ivy Jar repository,Conjars. Conjars is a public repository, any developer wishing to publish Cascading libraries and extensions can register their public key and push artifacts. Conjars is a simple fork of the Clojars repo code.

Along with this release are a number of extensions created by the Cascading user community.

Among these extension are:

  • Cascading.Avro – Cascading Scheme for the Apache Avro data serialization format.
  • Cascading.Memcached – Integration with Memcached, Membase, and ElasticSearch.
  • Bixo – a web mining toolkit
  • DBMigrate – a tool for migrating data to/from RDBMSs into Hadoop
  • Apache HBase, Amazon SimpleDB, and JDBC integration
  • JRuby and Clojure based scripting languages for Cascading
  • Cascalog – a robust interactive extensible query language

This release will run against 0.19.x, and 0.20.x. Including Amazon Elastic MapReduce.

Cascading 1.1.0 Now Available

We are happy to announce that Cascading 1.1.0 is now publicly available for download.

This release features many performance and usability enhancements while remaining backwards compatible with 1.0.

Specifically:

  • Performance optimizations with all join types
  • Numerous job planner optimizations
  • Dynamic optimizations when running in Amazon Elastic MapReduce and S3
  • API usability improvements
  • Support for TSV, CSV, and custom delimited text files
  • Support for manipulating and serializing non-Comparable custom Java types
  • Debug levels supported by the job planner

For a detailed list of changes see:
CHANGES.txt

Along with this release are a number of extensions created by the Cascading user community.

Among these extension are:

  • Bixo – a data mining toolkit
  • DBMigrate – a tool for migrating data to/from RDBMSs into Hadoop
  • Apache HBase, Amazon SimpleDB, and JDBC integration
  • JRuby and Clojure based scripting languages for Cascading
  • Cascalog – a robust interactive extensible query language

This release will run against Hadoop 0.18.3, 0.19.x, and 0.20.x. Including Amazon Elastic MapReduce.

Note the tests will not compile or run against Hadoop 0.18.3 due to package changes since that version.

Karmasphere Studio Ships with Cascading Support

The recently released Karmasphere Studio 1.2 now includes support for Cascading 1.0 in the free community download.

Karmasphere Studio is an IDE and Debugger for Hadoop MapReduce application developers that also includes integration with the Amazon Web Services platform.

And with Cascading support directly in the Debugger and IDE, developers can even more quickly develop and debug complex Hadoop jobs.

Also worthy of note, Karmasphere recently received $5M Series A funding.

Case Study: RazorFish User Segmentation with Cascading and Amazon Elastic MapReduce

Amazon recently published a case study on how RazorFish “segments users and customers based on the collection and analysis of non-personally identifiable data from browsing sessions”.

From the case study:

Mark Taylor, Program Director at Razorfish, said, “With our implementation of Amazon Elastic MapReduce and Cascading, there was no upfront investment in hardware, no hardware procurement delay, and no additional operations staff was hired. We completed development and testing of our first client project in six weeks. Our process is completely automated. Total cost of the infrastructure averages around $13,000 per month. Because of the richness of the algorithm and the flexibility of the platform to support it at scale, our first client campaign experienced a 500% increase in their return on ad spend from a similar campaign a year before.”

Read more about how RazorFish uses Cascading to process big data.