Key Takeaways
Cascading provides data scientists at The Climate Corporation a solid foundation to develop advanced machine learning applications in Cascalog that get deployed directly onto Amazon EMR clusters consisting of 2000+ cores. This results in significantly improved productivity with lower operating costs.
Solution
Data scientists at The Climate Corporation chose to create their algorithms in Cascalog, which is a high-level Clojure-based machine learning language built on Cascading. Cascading is an advanced Java application framework that abstracts the MapReduce APIs in Apache Hadoop and provides developers with a simplified way to create powerful data processing workflows. Programming in Cascalog, data scientists create compact expressions that represent complex batch-oriented AI and machine learning workflows. This results in improved productivity for the data scientists, many of whom are mathematicians rather than computer scientists. It also gives them the ability to quickly analyze complex data sets without having to create large complicated programs in MapReduce. Furthermore, programmers at The Climate Corporation also use Cascading directly for creating jobs inside Hadoop streaming to process additional batch-oriented data workflows. All these workflows and data processing jobs are deployed directly onto Amazon Elastic MapReduce into their own dedicated clusters. Depending on the size of data sets and the complexity of the algorithms, clusters consisting of up to 200 processor cores are utilized for data normalization workflows, and clusters consisting of over 2000 processor cores are utilized for risk analysis and climate modeling workflows.
Benefits
By utilizing Amazon Elastic MapReduce and Cascalog, data scientists at The Climate Corporation are able to focus on solving business challenges rather than worrying about setting up a complex infrastructure or trying to figure out how to use it to process the vast amounts of complex data. The Climate Corporation is able to effectively manage its costs by using Amazon Elastic MapReduce and using dedicated cluster resources for each workflow individually. This allows them to utilize the resources only when they are needed, and not have to invest in hardware resources and systems administrators to manage their own private shared cluster where they’d have to optimize their workflows and schedule them to avoid resource contention. Furthermore, Cascading provides data scientists at The Climate Corporation a common foundation for creating both their batch-oriented machine learning workflows in Cascalog, and Hadoop streaming workflows directly in Cascading. These applications are developed locally on the developers’ desktops, and then get instantly deployed onto dedicated Amazon Elastic MapReduce clusters for testing and production use. This minimizes the amount of iterative utilization of the cluster resources, thus allowing The Climate Corporation to manage its costs by utilizing the infrastructure for productive data processing only.