Category Archives: News

The Business of Data

April 24, 2014NewsKIm Loughead

The Business of Data
Jelani Harper, Dataversity
April 24, 2014
http://www.dataversity.net/business-data/

A health insurance company leverages its massive amounts of data to customize customer service for patients by conducting product assessments and medication reconciliations.

An exceedingly large and well-known power supply company makes substantial additions to the Internet of Things by manufacturing jet engines and windmills, and monitoring the output of their data to schedule maintenance.

An automobile manufacturing company merges manufacturing telemetry with the telemetry of its vehicles to optimize assembly line production.

The commonly found confluence of Data Scientists, engineers, and operations personnel of any number of industries designs recommendation engines for their web sites to boost traffic.

Regardless of the industry, regardless of the area of specialty of a particular enterprise, a number of organizations are involved in the business of Data – the utilization of data and data-driven processes to operate and improve their businesses.

At the core of this process of automating and operationalizing data to create meaningful action increasing business value is application building, a necessity for Big Data, conventional data, or any degree of integration of the two—and without which data might derive insight, but can’t actually do any of the aforementioned processes that is revolutionizing the way business is practiced in the 21^st Century:

“It is very clear that the broader world is moving towards enterprise data applications,” observed Chris Wensel, founder and Chief Technology Officer of Concurrent, whose open source application building framework Cascading is receiving 100,000 downloads a month with 11 percent month-over-month growth and thousands of deployments.

“It’s about businesses taking their assets and their data and building applications that enhance their business processes. It’s very clear that the market is moving on to the next level of maturity, and it has been doing so at a very rapid pace during the past 18 months.”

Without Data Science

Much of the current hype revolving about Data Science and the shortage of viable Data Scientists pertains to the myriad responsibilities these workers have. A large part of that responsibility has to do with integrating architecture from newer technologies (such as Hadoop and various NoSQL offerings) and their forms of data with legacy systems in a way that can produce meaningful action for the business. Part of the appeal of using application building platforms such as Cascading is that it is a Java-based API—which means that engineers and developers can utilize whatever Java compatible language they’ve been using for years to develop applications utilizing all forms of data, big or otherwise.

Subsequently, there is a whole set of skills necessary for manipulating Big Data platforms such as Hadoop and others related to Data Science that is no longer needed to create actionable applications from data. Organizations don’t necessarily need to wait for Data Scientists to graduate and can instead concentrate on solving their business problems in a much more expedient and economical fashion—while utilizing current personnel. Wensel acknowledged the impact of Cascading’s single API approach:

“What does the enterprise have on the bench? Java developers. They have to learn new APIs [to build data applications] and understand the business problems. Where Cascading comes in is we normalize that API. We give them one API to learn so they can solve multiple business problems and focus on the quality of the business stack underneath and not focus on anything else, while continuing to leverage their existing talent.”

Operationalizing Data

In addition to considerably reducing the skills necessary to manipulate Big Data and transform it with applications that fulfill business objectives, application building platforms also produce a degree of reliability and automation that is necessary for applications to function accordingly. Although scripts such as R and others are useful for creating analytics and building applications, they frequently lack the scalability to handle Big Data sets.

The true advantage to application building is that it effectively operationalizes data, and goes a step further than providing the conventional insight of Business Intelligence or analytics to actually provide action. There are a number of critical prerequisites that must be accounted for to consistently operationalize data in a manner that is reliable enough so that crucial business processes can hinge upon them. The action must be continuous, self-sustaining, and ideally function in a way so that even laymen can use it.

Without a comprehensive application building platform, Data Scientists and IT are simply linking together a variety of disparate technologies in a way that may cause latency or, even worse, provides the potential for malfunctioning. With such a platform, IT and operations personnel can utilize a single framework in a language with which they are already familiar to provide that perpetual reliability needed to automate valuable business processes. And, with application building monitoring tools such as Concurrent’s Driven, which provides unparalleled insight and specificity into the nature of applications and revolutionizes the time frame required to address latency issues and malfunctions, those same personnel can ensure that those processes are actually optimized. Concurrent CEO Gary Nakamura commented that:

“The skills cap is broadly systemic across most organizations. They all want to access the data, and they all want to build things on top of new data stores like Hadoop, but the challenge is that they’ve been using SQL for the last 30 years or they’ve been building Java applications for the last 20 years. There is a situation where only the elite can do low level things.”

That’s where Cascading provides the instruction and the interfaces so the rest of the organization with different levels of skills in front of the enterprise can leverage the data in Hadoop and operationalize it so that they can put applications into business processes.

Integration Logic

In addition to facilitating the business logic and operationalizing processes that are integral for professionals to maximize business value, application building also plays a crucial role for integrating a variety of data sources and management tools. As Big Data technologies mature, there is an emerging trend for the enterprise to utilize platforms such as Hadoop to store all of their data and to integrate Big Data with traditional proprietary data to conduct comprehensive analytics on aggregated data sets.

A similar capability exists for data application frameworks such as Cascading, which also has an integration API in which developers or Data Scientists can utilize this single tool to access various types of data, programming languages and platforms to manipulate data that is pivotal for a particular application. The result is that developers, operations personnel and engineers can now focus on business logic regardless of the different technologies involved, since Cascading has a specific API for integration. Wensel discussed the integration capabilities of Cascading in the following case study:

“Why can’t you just read data from Oracle and write it in Cassandra with a single application, and not have five applications to do that and have that data be stale the moment you actually get your hands on it? Cascading is designed to solve that problem; it’s first-class. It’s focused on solving any kind of problem where integration with existing legacy systems is a priority.”

Better than BI?

The reality of the influence and influx of data on business and operational processes is that in more and more instances, organizations are actually becoming involved in the business of data. Data is no longer simply used to enhance their processes, but instead is actually sold as a product or service—such as is the case with geo-spatial data or gene-sequencing information. In such situations, data applications are a requisite for the business and effectively represent the vital means by which such business is provided and for how it functions.

In this respect, data application building represents the evolution of BI. Organizations are transcending the ability to simply conduct analysis with data, and are instead using that data to produce specific action that creates business value:

“That’s truly what’s happening here,” Wensel reflected. “BI is being overrun by people who are actually delivering data as a product based on data. They’re taking raw materials, and delivering things. BI is a subset of that.”

In which case, data application building is the larger picture.

Hortonworks, Concurrent Team Up on Cascading SDK

April 23, 2014NewsKIm Loughead

Hortonworks, Concurrent Team Up on Cascading SDK
Barry Levin, Newsfactor
April 22, 2014
http://www.newsfactor.com/story.xhtml?story_id=112009BFL3Z4

“By expanding our alliance with Concurrent and integrating with the Cascading application platform, Hortonworks’ customers can now drive even more value from their enterprise data by enabling the rapid development of data-driven applications,” said John Kreisa, vice president of marketing at Hortonworks, of the Concurrent partnership.

Concurrent and Hortonworks are teaming up to streamline enterprise application development of data-centric applications. The strategic partnership was announced Tuesday.

Concurrent, an enterprise data application platform company, will integrate its Cascading SDK, a widely used application development framework for Hadoop-based data applications, with Hortonworks Data Platform. Hortonworks, a provider of enterprise-targeted Apache Hadoop, will certify, support and deliver Cascading.

Hortonworks said it will also guarantee the ongoing compatibility of Cascading-based applications with future releases of the Hortonworks Data Platform, through continuous testing for compatibility and direct customer support.

Apache Tez

Cascading, in a future release, is expected to support Apache Tez, which will allow projects to accommodate faster response times and near real-time data processing. Companies that are using Cascading, Lingual, Scalding, or other dynamic programming language APIs and frameworks will now have the ability to migrate to new releases of the data platform that support Apache Tez.

On its Web site, Hortonworks said Apache Tez allowed “projects in the Apache Hadoop ecosystem…to meet demands for fast response times and extreme throughput at petabyte scale.”

John Kreisa, vice president of strategic marketing at Hortonworks, said in a statement that, “by expanding our alliance with Concurrent and integrating with the Cascading application platform, Hortonworks’ customers can now drive even more value from their enterprise data by enabling the rapid development of data-driven applications.”

Cascading is open-source software that is employed for the development and implementation of complex data processing workflows on Hadoop, thereby providing an abstracted and higher level control over MapReduce processing, which is the data processing backbone in Hadoop. It offers a computation engine, systems integration framework, data processing and scheduling. Concurrent’s services include commercial support for Cascading, and Cascading was initially created by Chris Wensel, the Concurrent founder.

‘A Must-Have’

Also open source, Apache Hadoop is a framework for processing large data sets. It is intended to provide insights into large stores of structured and unstructured data. Hortonworks was founded in 2001 by members of the original Hadoop development and operations team at Yahoo.

“The need for simple, powerful tools for big-data application development is a must-have to survive in today’s competitive climate,” Concurrent CEO Gary Nakamura told news media.

Hortonworks’ alliance with Concurrent is only the latest in a series. Last week, the Palo Alto, Calif.-based Hortonworks announced a new addition to its joint engineering alliance with open-source provider Red Hat. The two companies will integrate Red Hat’s platform-as-a-service OpenShift with Hortonworks Data Platform, so that Hadoop can be used in an open hybrid cloud. This follows an announcement in February that the companies will deliver an open-source initiative for delivering Hadoop to the hybrid cloud.

In early April, Hortonworks announced a strategic partnership with Lucidworks, allowing users of Hortonworks Data Platform to access and analyze their data through the open-source enterprise search platform, Solr.

At the beginning of April, Hortonworks released version 2.1 of its data platform. Highlights of the new release included interactive SQL query, improved capabilities for data governance and security, and two new processing engines for streaming and search.

Hortonworks Adds Cascading For Big Data App Development

April 22, 2014NewsKIm Loughead

Hortonworks Adds Cascading For Big Data App Development
Doug Henschen, Information Week
April 21, 2014
http://www.informationweek.com/big-data/software-platforms/hortonworks-adds-cascading-for-big-data-app-development/d/d-id/1204594

Hortonworks adds Concurrent’s Cascading SDK to its Hadoop distribution to help developers operationalize big data applications.

Hortonworks wants to make it easier to build big data applications, so on Monday it announced that it will add software and support for the popular Cascading app-development framework to its Hadoop distribution.

Developed and supported by Concurrent Inc., Cascading is a Java-based framework for app development popularized by Internet giants including eBay, LinkedIn, and Twitter, and increasingly used by more conventional enterprises to operationalize big data applications. Where data analysts tend to do interactive, ad-hoc analyses across Hadoop, the Cascading framework is geared to application developers who have to create repeatable big-data systems that run day after day. For example, one of Cascading’s cable company customers is using Cascading to develop applications based on set-top box data now analyzed on Hadoop.

“This company brings in 19 terabytes of set-top-box data per day, and they need to build applications that consume that data, process it, and deliver data products to different constituents including marketing and sales,” said Gary Nakamura, Concurrent’s CEO in a phone interview withInformationWeek.

Cascading shields developers from the complexities of Hadoop programming, and with recent updates it has been certified by Hortonworks to work with Hadoop 2.0 and its YARN resource management framework. Cascading will also make use of Tez, a new feature of Hadoop 2.0 that eliminates the intermediate writes and delays associated with first-generation MapReduce programming.

“We’ve gone a lot deeper with Hortonworks with this announcement so that the 6,500-plus deployments that we have of Cascading can migrate from using MapReduce to Apache Tez without any code changes,” said Nakamura.

Concurrent’s partnership with Hortonworks is non-exclusive, according to Nakamura, but he described Hortonworks as having “open arms to other technologies that help with the broader ecosystem and enterprise adoption of Hadoop.” Nakamura didn’t elaborate, but one reason for the tighter partnership with Hortonworks might be Cloudera’s efforts to go beyond Hive with its Impala offering, which offers an interactive SQL interface for Hadoop. Concurrent offers Cascading Lingual as a SQL-on-Hadoop interface for developers building analytic applications.

With this week’s announcement, Hortonworks will ship the Concurrent Software Development Kit as part of its Hortonworks Data Platform distribution and it will also offer first- and second-tier support for the software.

What Is Data Flow and Why Should You Care?

February 24, 2014NewsKIm Loughead

What Is Data Flow and Why Should You Care?
Eric Kavanagh, Inside Analysis
February 24, 2014
http://insideanalysis.com/2014/02/what-is-data-flow-and-why-should-you-care/

What goes around surely comes back around, which in the world of data is often called lifecycle management. To be blunt, very few organizations have ever formalized and implemented such a grandiose practice, but that’s not a pejorative statement, for only until recently has the concept become seriously doable without great expense.

Lifecycle management means following data from cradle to grave, or more precisely, from acquisition through integration, transformation, aggregation, crunching, archiving, and ultimately (if ever) deletion. That last leg is often a real kicker, and has entered the spotlight largely in the context of eDiscovery, which tends to be discussed in the legal arena – too much old data lying around can become a definite liability if some lawyer can use it against you.

But there’s a new, much more granular version of lifecycle management circulating these days, and it’s described by Dr. Robin Bloor as Data Flow. In fact, he even talks of a data flow architecture which can be leveraged to get value from data long before it ever enters a data warehouse (if it ever even does). Data Flow in this context can be a really big deal, because it can deliver immediate value without ever beating at the door of a data warehouse.

Data streams embody one of the hotter trends in data flow. Streams are essentially live feeds of data spinning out of various systems of all kinds – air traffic control data, stock ticker data, and a vast array of machine-generated data, whether in manufacturing, healthcare, Web traffic, you name it. Several innovative vendors are focused intently on data streams these days, such as IBM, SQLstream, Vitria, Extrahop Networks and others. The use cases typically revolve around the growing field of Operational Intelligence.

Data Flow Oriented Vendors

Finding ways to effectively visualize data flows can be a real treat. Some of the most talked-about vendors these days have worked to provide windows into the world of data flow – or at least basic systems management – primarily using so-called big data. Both Cloudera and Hortonworks have built their enterprises on the shoulders of Apache Hadoop, the powerful if somewhat cryptic engine that has the entire world of enterprise software in a genuine tizzy.

But there are a few other vendors who have excelled in the domain of providing detailed visibility into how data flows in certain contexts. The first that comes to mind is Concurrent, which just unveiled their Driven product. This offering is almost like the enterprise data version of a glass-encased ant farm. Remember those from childhood days? You could actually watch the ants build their tunnels, take care of business, cruise all around, get things done. For Driven, this is systems management 3.0 – you can actually see how the data moves through applications, where things go awry, and thus fine-tune your architecture.

Another vendor that talks all about data flow is Actian. Formerly known as Ingres, the recently renamed vendor went on an acquisition spree in recent times, folding ParAccel, Pervasive Software and Versant into its platform data-oriented portfolio of products. Mike Hoskins, once the CTO of Pervasive is now the CTO, of Actian and can be credited years ago with having the vision to build a parallel data flow platform which originally went by the name of DataRush but is now simply referred to as Actian DataFlow. Actian’s view of the Big Data landscape involves Hadoop as a natural data collection vehicle and its DataFlow product as a means either of processing Hadoop data in situ or flowing it (also employing its data integration products) to an appropriate data engine of which it has several, including Matrix (a scale-out analytical engine once known as ParAccel) and Vector (a scale-up engine).

And then there’s Alpine Data Labs, a cloud-based solution that offers a data science workbench. The collection of Alpine offerings, some of which derive from Greenplum Chorus, provide a wide range of functionality for doing all things data: modeling, visualizing, crunching, analyzing. And when you push the big red button to make the magic happen, you get a neat visual display of where the process is at any given point. This is both functional and didactic, helping aspiring data scientists better understand what’s happening where.

Like almost all data management vendors, Alpine touts a user-friendly, self-service environment. That said, the “self” who serves in such an atmosphere needs to be a very savvy information executive, someone who understands a fair amount about all the nuts and bolts of data lifecycle management. And though Alpine also talks of no data movement, what they really mean by that is data movement in the old ETL-around-the-mulberry bush sense. You still need to move data into the cloud, and set your update schedule, which incidentally runs via REST.

Of course, in a certain sense, there’s not too much new under the sun. After all, data flow happened in the very earliest information systems, even in the punch card era. But these days, the visibility into that movement will provide game-changing awareness of what data is, does, and can be.

Concurrent aims to smash the Hadoop “black box”

February 20, 2014NewsKIm Loughead

Concurrent aims to smash the Hadoop “black box”
Lucy Carey, JAXenter
February 20, 2014
http://jaxenter.com/concurrent-aims-to-smash-the-hadoop-black-box-49506.html

Traditionally, one of the biggest productivity snags for app developers has been the Hadoop “black box” scenario, which makes it difficult to decipher what’s actually going on inside their project. Stepping up to solve this problem are Concurrent, with a (currently beta) solution called Driven, billed as “the industry’s first application performance management product” for Big Data apps.

Concurrent are probably best known for Cascading – a highly extensible Java app framework which makes it quick and easy for devs to tool rich Data Analytics and Data Management apps to deploy and manage across diverse computing environments. CEO Gary Nakamura describes it primarily as, “a tool that does the heavy lifting and converts your application logic built on Cascading into MapReduce jobs so that the developer doesn’t have to deal with the low level assembly language of Hadoop.”

On top of this, it delivers a computation engine, systems integration, data processing and scheduling capabilities through common interfaces, and runs on all popular Hadoop distributions – though it’s also capable of extending beyond the elephant ecosphere.

Regarding the topic of Hadoop, although it may have lost some of its lustre in recent months, Gary is positive about the future of the technology, believing that the dimming of the “buzz” around it is a positive.

According to Gary, “Hadoop is maturing, and the user behavior is maturing along with it. Enterprises are taking a more pragmatic approach to formulating their data strategy and looking for the right tools to ensure success.” Whilst the Hadoop ecosystem is still quite “convoluted and confusing”, he belives that “those that have taken a deliberate and informed approach are the ones that are succeeding.”

He adds that, “enterprises have moved on to building data applications on their Hadoop investments and are driving business process and strategy through these data applications.” It’s this new wealth of data and applications that will underpin innovation and business advantage going forward, he affirms, “not necessarily the fabric that an application runs on.”

For this reason, Concurrent are confident that 2014 is the year for enterprise data applications to come to the fore, as Hadoop-based apps become “business critical and essential for businesses to move forward.”

Cascading is well placed for this scenario, with a number of differentiators from rival offerings to give devs that all important productivity edge when building robust data applications.

As Gary puts it, “With Cascading, instead of becoming an expert in MapReduce, you can now leverage existing skill sets such as enterprise Java, Scala, SQL, R, etc.” Moreover, he says, Cascading’s local mode enables test-driven development practices where developers can efficiently test code and processes on their laptops before deploying on a Hadoop cluster.

On to the newer offering – Driven, which was inspired by “years of painstaking experience with building enterprise data applications.” Its raison d’être is to make the process of developing, debugging and managing Cascading apps that bit more painless, as well as allowing for easy management of production data applications.

It’s been a long time in the making, with the key challenge centering on waiting for the market to mature enough for it to be viable product, as opposed to any specific development challenges. For Concurrent, the magic time for “broad applicability” of Driven has now come.

Gary cites the key “Driven difference” as being a significantly faster timeline – up to ten times, in fact – for enterprise data applications to reach production. With Driven’s real time app visualisation, instant app performance analytics, and well as data management and monitoring capabilities, Gary thinks that users will find they have “unprecedented visibility into your applications that don’t exist in the Hadoop ecosystem today.” And, most crucially, circumventing the aforementioned “black box” scenario.

Although Concurrent doesn’t currently see any direct competitors for their tools, there are “indirect entities” around which they compete – for example, the open-source tools in Apache Hadoop ecosystem.

The majority of Cascading’s popularity to date is something that Gary puts down to the “attrition” of developers using Pig and Hive. For now, the hope is that devs will ultimately adopt Cascading and Drive (once it moves out of beta status) as a tandem productivity enhancement solution, continuing this upward trajectory.

We’re All in the Data Business

February 11, 2014NewsKIm Loughead

We’re All in the Data Business
Gary Nakamura, CEO, Concurrent, Inc.
February 11, 2014
http://recode.net/2014/02/11/were-all-in-the-data-business/

Imagine finding out that your headquarters is sitting on a diamond mine. But you’re an architectural firm, oil company, or a commercial real estate company — what do you know about diamonds?

Data is like that. Simply put, no matter what kind of company you are, you’re in the data business and you’re sitting on a kind of mine, whether or not you’ve tapped it. “You’ve got us mixed up with Twitter,” you might say. But hear me out. Every enterprise collects data. Facebook, Google and Twitter are clearly in the business of data. These organizations have mastered wrangling the complexity of data and turning it into products or services.

Smaller organizations are now looking at their data in new ways. Organizations that recruit employees and temp agencies may find they can sell data about employment trends to others who are looking for more sources of information about this key economic indicator.

For startups, monetizing data is a common strategy. Kaggle runs contests for analyzing big data. In the process, it created one of the largest collections of data about active data scientists in the world, an asset that can be used in dozens of ways.

A Japanese firm found that by looking at elevator activity from a building-automation system they could predict the likelihood of lease renewals. Building-automation data is being used for dozens of other money-saving purposes.

As more and more “things” like elevators become smart, more data will arrive, and it seems obvious to see what value can be obtained from it. Figuring out how you will profit from data means looking at your data from new angles and determining how the data you have is valuable to you or to your partners.

Mashing up data sources for business insight

You have more data than you think you do. Twitter, for all its servers, is no New York Stock Exchange; it’s also not a chemical plant, which generates more data in a day than Twitter does in a month. The enterprise grabs its data from an astounding number of sources: From Salesforce.com or its CRM system, its back-office systems such as SAP or Oracle, not to mention its downloads, click-throughs, Facebook “Likes” and Twitter followers. If you’re a manufacturer, you can pile on data from the factory-automation system and remote sensors. If you’re a retailer, you can add data from your point-of-sale and warehouse management systems. The key point here is not just having so many rich-data sources, but creating new alloys by mashing up those data sources in different ways to increase their business value.

Data-driven decisions = data monetization

Any enterprise that uses data to drive decisions is monetizing its data. In 2011, MIT researchers analyzed 179 publicly traded companies to find that data-driven decision-making (versus relying on a leader’s gut instincts) translated into about five percent higher productivity and profit. I’m surprised those numbers aren’t higher. Those same companies tended to score better in performance measures like asset utilization, return on equity, and market value, which are all forms of monetization.

Hard goods, big data

Manufacturers understand the value data can bring and, according to IDC, 43 percent of manufacturers are actively designing automated, connected factories of the future. Six in 10 manufacturers expect their production processes to be mostly or completely digitized in the next five years. The key in all this is not just the automation of production processes, but the ability to “listen” to all the data the machines in the factories are generating. Manufacturers can then aggregate that machine data, along with other data sources, for everything from optimizing maintenance schedules to saving energy costs to creating a more resilient supply chain. In other words, automation saves money, but it also generates voluminous data that can be used to create fine-grained operational models. Viewing that information in new ways can lead not only to new insights but also to new lines of business.

General Electric is in the data business. The company once known for “Bringing Good Things to Life” with consumer products is now leading with what it calls “The Industrial Internet,” aimed at connecting intelligent machines (like its own gas and steam turbines) with advanced analytics. The payoff is in preventing downtime and eliminating unnecessary labor. GE estimates that it costs the industry 52 million labor hours and more than $7 billion a year to service the gas and steam turbines at work across 56,000 power plants. Servicing machines that are in perfect working order wastes much of that labor. Sensor data tells them which machines are running outside normal ranges (for example, for temperature and vibration) and need servicing. Apply data like that to commercial aircraft, fleet vehicles, conveyors, medical equipment like MRI scanners, and the savings are enormous — especially if you eliminate a costly breakdown.

The new competitive landscape is data

Data is the new competitive landscape, and how we can use data in new ways is becoming clearer all the time. If we think of data as a product, then certainly Amazon, Google, Facebook and Twitter come to mind — they aggregate and sell user data for geospatial advertising (among other uses).

But data can be a product for internal use, as well. A bank client of my company, Concurrent, creates numerous products using its data, both customer-facing products and internal products geared toward risk management, compliance and trading policies. That bank assigns product managers to both types of products — which sounds novel for data products, but I predict that it will become the rule, not the exception. These data product managers bridge business requirements, imperatives and processes with their data. As more organizations recognize that they are in the business of data, data products will need to be managed like any other product line.

Trading data for data

Another way that companies are in the business of data is by trading data that they have — partial data sets — for more coherent data sets. One example is Jigsaw, the crowdsourced database of companies and contacts. “No more renting or buying costly company directory lists,” Jigsaw promises (presumably from costly services like Bloomberg). Instead, you share the contact information you have, and receive credits toward a more complete data set. A second example is Factual, with location-based data for mobile personalization and ad targeting. User companies contribute their own consumer location data for access to bigger data sets and to product data on more than 650,000 consumer packaged goods. (That’s data that you historically paid for with a United Product Code (UPC) membership.) Factual promises that companies can “share and mash open data on any subject,” and share and mash they do. Both Jigsaw and Factual offer premium services like data cleansing, but the price of admission is your partial data set.

Learning to mine big data

Enterprises of all kinds will only grow more data-rich, meaning there are more riches to be found. If data is a kind of rich repository (like diamonds), then it requires both crude and refined tools. Hadoop is like dynamite, just the first tool in mining diamonds. Yes, it can aggregate data into a whole, but none of those data leaders (like Amazon or GE) installs Hadoop and declares its job done. They are developing testable, reusable data processing applications that turn data into products they can use.

It will take creativity and business acumen to look at the data we have and begin to imagine its uses, either for us or for others. What I can tell you is that you are sitting on that mine. Exactly how that data is valuable to you is the question that I invite you to consider. Since we’re all in the business of data, I guarantee that the answers to that question will be worth the time you take to ponder it.

What you missed in Big Data : New solutions shift analytics landscape

February 10, 2014NewsKIm Loughead

What you missed in Big Data : New solutions shift analytics landscape
Maria Deutscher, SiliconANGLE
February 10, 2014
http://siliconangle.com/blog/2014/02/10/what-you-missed-in-big-data-new-solutions-shift-analytics-landscape/

It’s been an exciting month for the Big Data ecosystem, with Cloudera repackaging its portfolio in an effort to make Hadoop more accessible to organizations of different sizes. Two emerging players also made headlines with new solutions for developing analytical applications.

Announced on Monday, the Cloudera Enterprise Data Hub combines the vendor’s open source distribution with a set of “advanced components” that consists of HBase, the Impala structured query engine and homegrown search and auditing tools. The offering also includes across-the-board support services and integration with Spark, an in-memory alternative to MapReduce that executes queries up to 100 times faster.

Organizations that don’t need all this extra functionality can opt to go with the Flex Edition, a slimmed down version of the Enterprise Data Hub that comes with just one premium. For the most economic-minded organizations, Cloudera is offering a Basic Edition that features only the core Hadoop distro plus support.

Over on the NoSQL front, Orchestrate launched a cloud-based service that allows developers to use a single API for accessing and managing information across different databases. The firm says that this abstraction reduces complexity and empowers users to drive more value from their data. The platform supports text search, graph, activity feed, time-ordered event and key-value queries, with more types to come.

Concurrent is taking a different approach to streamlining the development of data-driven apps. The company last week introduced a free performance monitoring tool that aims to simplify the maintenance of software built using Cascading, an open source Java framework for Hadoop. Dubbed Driven, the solution tracks data flows at runtime to ensure information quality and includes a broad set of metrics around program logic, giving operations professionals visibility into application behavior.

Concurrent Launches Driven APM Service for Hadoop Apps

February 7, 2014NewsKIm Loughead

Concurrent Launches Driven APM Service for Hadoop Apps
Mike Vizard, IT Business Edge
February 7, 2014
http://www.itbusinessedge.com/blogs/it-unmasked/concurrent-launches-driven-apm-service-for-hadoop-apps.html

Cascading has emerged as an open source alternative to the Java application programming interface (API) for building Big Data applications on top of Hadoop. Developed by Concurrent, Cascading has now been deployed in over 6,000 applications, which now makes managing all those applications a challenge.

To address that issue, Concurrent this week unfurled Driven, an application performance management (APM) service that organizations using Cascading can employ via the cloud. Accessed via a plug-in that developers insert inside a Cascading application, Driven exposes an API that allows organizations to track a variety of application performance metrics in real time.

With more organizations building applications on top of Hadoop, Concurrent CTO Chris Wensel says that figuring out where a specific performance issue may lie across a cluster of servers that can consist of hundreds of servers is now a major challenge.

Currently in beta, Wensel says Driven makes it a lot easier to identify not only the applications that are failing to run, but those that are running poorly for one reason or another. That capability is especially critical, says Wensel, when dealing with Hadoop applications that routinely span multiple terabytes of data.

While the development of Big Data applications is clearly still in its infancy, the managing of those applications represents a major challenge of IT operations teams that often have little familiarity with Hadoop. Rather than acquiring APM tools for Big Data applications and the infrastructure needed to run them, Driven provides a convenient service alternative that can be quickly deployed. Best of all, that approach doesn’t require the IT operations team to become a master of all things Hadoop overnight.

Concurrent’s new Hadoop monitoring tool looks out for big-data snags

February 4, 2014NewsKIm Loughead

Concurrent’s new Hadoop monitoring tool looks out for big-data snags
Jordan Novet, VentureBeat
February 4, 2014
http://venturebeat.com/2014/02/03/concurrents-new-hadoop-monitoring-tool-looks-out-for-big-data-snags/

Software developers want to know if their applications are running properly on the Internet. Just look at the runaway success of application-performance management companies like New Relic and AppDynamics.

Similarly, engineers who process lots of different kinds of data using the Hadoop ecosystem of open-source software need a monitoring tool to make sure their Hadoop jobs are running right.

Just such a tool is coming out today from Concurrent, the company behind the open-source Cascading framework for implementing Hadoop jobs.

It’s called Driven. And it’s been a long time in the making. Founder Chris Wensel has been talking about it for at least two and a half years.

“The net of it is that we’re going to be able to provide very detailed, high-fidelity [information] about data applications as they’re running on Hadoop,” Wensel said in an interview with VentureBeat. “So things like, you know, your joins and merges and filters — you’ll be able to see them progress … . We’ll notify you of failures, and we’ll take you right to the line of code where the application failed essentially and give you insight to go and fix the issue.”

Concurrent is making Driven available as a free service that runs in the cloud, so Cascading users can get a taste of it while building applications, Gary Nakamura, Concurrent’s chief executive, told VentureBeat. If people want to use Driven in production or in on-premises data centers, they’ll have to pay for an annual license, Nakamura said.

Engineers using the service in an early adopter program have been pleased with Driven’s ability to identify problems, so debugging can take less time, Wensel said.

Driven sounds like the kind of thing that could make more companies comfortable with the idea of experimenting with Hadoop. Hadoop can work in addition to or in the place of more traditional, and typically more expensive, data warehousing technology from companies like IBM and Teradata. Hadoop still isn’t extremely widely adopted, but it’s gotten more popular each year since the Apache Hadoop project began in 2006.

The complexity of Hadoop has inhibited its adoption, along with concerns about security and other issues. If Driven does manage to make Hadoop easier to operate, Concurrent could stand to enjoy a degree of the success that Hadoop companies like Cloudera have. Cloudera is in a position to go public later this year.

San Francisco-based Concurrent started in 2008, and to date it has raised $5 million, including a $4 million round last year.

How Concurrent hopes to help ID Hadoop bottlenecks

February 4, 2014NewsKIm Loughead

How Concurrent hopes to help ID Hadoop bottlenecks
Nancy Gohring, IT World
February 4, 2014
http://www.itworld.com/cloud-computing/403167/how-concurrent-hopes-help-id-hadoop-bottlenecks

February 04, 2014, 4:00 AM — Concurrent, the company behind Cascading, the application framework for building big data apps, is now hoping to take the pain out of managing big data apps.

Concurrent is releasing today in beta a product called Driven that lets developers identify problems that are holding up their Hadoop applications.

“Enterprise developers will be able to see their apps running and understand where their apps are failing, down to the line of code,” said Gary Nakamura, CEO of Concurrent.

He hopes Driven will solve a problem that is eating up loads of time for developers. “Hadoop is a complete black box,” he said. “In order to find out what happened to an app running on Hadoop, you have to scrape logs from thousands of nodes and look for a needle in a haystack. That’s often an untenable task that takes weeks or months,” he said.

In fact, companies are now filling positions for a person who’s entire job it is to “stare at Hadoop clusters all day long and make life or death decisions,” said Chris Wensel, CTO of Concurrent.

Complicating matters is that when administrators look at Hadoop jobs, they may see one job that in fact includes multiple jobs. Without insight into those jobs, an administrator may kill a job because it’s clear it’s operating poorly and unknowingly kill an important job that’s been running for a day.

With Driven, users will not only be able to see individual jobs but see who owns them. So if an admin finds a job that’s behaving badly but it’s one that’s a very important, time sensitive project that’s nearly finished, the admin can decide to let it go.

“Or if it turns out that an intern wrote some bad code, you can kill it and go talk to that intern,” Wensel said.

Driven will show enterprise developers where slow-downs are occurring so that they can fix the problem. It also includes some collaboration features designed to make it easy for developers, data scientists and operators, all of whom may work together on an app, to share views of the app in order to discuss areas that might need improvement.

Driven is a cloud service that can work on Hadopp applications running internally or in a public cloud. For now, it supports Cascading apps, which Concurrent says means it’ll be available for multiple thousands of app deployments. It’s free to use in a development environment now with a planned launch of a version for commercial deployments, which will include support, scheduled to launch in the second quarter.

It’s a safe bet that more products like this will appear, given the growth of big data and the emerging challenges enterprises face in managing it.