EMC and Pivotal are offering the Data Lake Hadoop Bundle 2.0, a turn-key solution that includes server, analytics and storage for customers building scale-out Data Lakes for predictive analytics.
Data Lakes are gaining interest in and around IT water coolers as an infinitely scalable repository for critical data generated by traditional and next-generation workloads. EMC’s interpretation of scale-out Data Lake is designed to be enterprise ready to help organizations derive immediate business value from Big Data.
The Data Lake Hadoop Bundle 2.0 includes EMC’s Data Computing Appliance (DCA) – a big data computing appliance, Isilon NAS, Pivotal HD and .
Ashish Nadkarni, Research Director of IDC says “Combining storage, compute and analytics capabilities for scale-out Data Lakes is tremendously valuable in this age of Big Data—and the addition of predictive functionality means that customers can quickly put this solution to good use to help positively impact their bottom line.”
Gartner, however, warns that vendors are marketing data lakes as an essential component to capitalize on Big Data opportunities, but there is little alignment between vendors about what comprises a data lake, or how to get value from it. The result is confusion.
"In broad terms, data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format," said Nick Heudecker, research director at Gartner. "The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organization."
However, while the marketing hype suggests audiences throughout an enterprise will leverage data lakes, this positioning assumes that all those audiences are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata.
Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured. The data lake concept hopes to solve two problems, one old and one new. The old problem it tries to solve is information silos. Rather than having dozens of independently managed collections of data, you can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction.
The new problem data lakes conceptually tackle pertains to Big Data initiatives. Big Data projects require a large amount of varied information. The information is so varied that it's not clear what it is when it is received, and constraining it in something as structured as a data warehouse or relational database management system (RDBMS) constrains future analysis.
"Addressing both of these issues with a data lake certainly benefits IT in the short term in that IT no longer has to spend time understanding how information is used – data is simply dumped into the data lake," said Andrew White, vice president and distinguished analyst at Gartner. "However, getting value out of the data remains the responsibility of the business end user. Of course, technology could be applied or added to the lake to do this, but without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place."
Gartner warns that data lakes carry substantial risks chief of which is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. Because a data lake accepts any data, without oversight or governance, without descriptive metadata and a mechanism to maintain it, the data lake risks can turn into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.
Another risk is security and access control. Data can be placed into the data lake with no oversight of the contents. Many data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security capabilities of central data lake technologies are still embryonic. These issues will not be addressed if left to non-IT personnel.
Finally, performance aspects should not be overlooked. Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure. For these reasons, Gartner recommends that organizations focus on semantic consistency and performance in upstream applications and data stores instead of information consolidation in a data lake.
"There is always value to be found in data but the question your organization has to address is this — do we allow or even encourage one-off, independent analysis of information in silos or a data lake, bringing said data together, or do we formalize to a degree that effort, and try to sustain the value-generating skills we develop?" said White. "If the option is the former, it is quite likely that a data lake will appeal. If the decision tends toward the latter, it is beneficial to move beyond a data lake concept quite quickly in order to develop a more robust logical data warehouse strategy."
So before you even think about inviting EMC and Pivotal to demo their new offering, best to consult a third party (preferably not from either vendors) to interpret what a data lake means to your business and the business users working in your organization.