Data lakes have gained popularity among the general public over the past few years. Despite the lack of consensus regarding the definition, global tech giants such as Amazon, Alibaba, Tencent, and Huawei have developed plans to construct their own.
In the age of big data and artificial intelligence, data lakes are expected to become a key platform for the convergence of storage, computing, and analytics, and this trend is even more evident when data lakes are complemented by cloud-native technologies.
Data lakes are on the rise
In 2010, James Dixon, founder and CTO of Pentaho, introduced the concept of data lakes, which are analogous to raw water since they contain unprocessed data that retains its original structure.
Various types of users can access the lake to obtain, distill, and purify data (water) flowing from multiple sources. Therefore, the data lake is typically characterized as a centralized system for storing unstructured, semi-structured, and binary data in its original format, which can store structured, semi-structured, and unstructured data.
In the convergence and development of big data, the boundaries of data lakes are expanding. This develops into a comprehensive big data solution with unified storage, multi-paradigm computation and analysis of multi-source heterogeneous data, and unified management and invocation.
In this regard, data lakes differ significantly from data warehouses.
A data warehouse is a solution designed to convert data into a particular format and then replicate it to another library for columnar storage at regular intervals to meet enterprise querying and data analysis needs.
Business data used to be primarily ERP and CRM data, which can often be terabytes in size. Therefore, enterprises typically use data warehouse solutions locally to store and analyze their data. Data warehouses are fixed paradigms, and the data underlying the paradigm cannot be changed.
Internet development has resulted in an explosion of data, especially a rise in unstructured ones, and accelerated changes in enterprise systems. Digital transformation, now a hot topic in the IT industry, calls for a deeper understanding of data. Thus, it is imperative to retain the original information contained in the data to meet the changing needs of the future.
With the advent of big data, traditional data warehouses can no longer meet enterprises' demands for real-time and interactive analysis, but data lakes have adopted the design principle of "loose in, tight out", eliminating the strict model during the initialization phase and placing the "schme" later to achieve further flexibility, while ensuring data consistency and performance through unified storage and computation optimization. As such, big data has gradually gained attention towards the data lake model.
The concept of the data lake is no longer restricted to a specific technology or software product but covers a wide range of applications such as storage, computing, and artificial intelligence to meet the needs of enterprise-level users in terms of production management.
Cloud-native and data lakes: Why do they make great partners?
With the rapid evolution of enterprise business, database middleware such as Oracle has become increasingly difficult to adapt to the changing needs of data processing.
For this reason, the IT industry continuously produces new computing engines.
Several enterprises have developed their own open-source Hadoop data lake architectures, where the original data is stored on HDFS uniformly, and the engine is based on the Hadoop and Spark open-source ecosystems, enabling storage and computing to be converged as one.
This architecture, however, has some disadvantages for enterprises because they must operate and manage the whole cluster independently, which is costly and results in poor stability.
In this case, a cloud-hosted Hadoop data lake was created (i.e., the EMR open-source data lake). Cloud vendors provide and manage the underlying physical servers and open-source software versions, and the data is still stored uniformly in HDFS on an engine based on the Hadoop and Spark open-source ecosystems. With this architecture, enterprises can enhance machine resilience and stability using cloud-based IaaS, thereby reducing their overall operational costs; however, enterprises are still responsible for the operations of applications such as managing and governing the HDFS system and services.
As storage and computing are coupled, the stability is not optimal, the resources cannot be scaled independently, and the cost of use is still high due to the close coupling. Meanwhile, due to the internal limitations of open-source software, traditional data lakes cannot meet enterprises' needs regarding data scale, storage costs, query performance, and flexible computing architectures. In other words, the data lake architecture is not yet ideal.
By utilizing cloud computing, the data lake can be maximized and played to its full potential. Cloud computing has highly flexible, resilient, and scalable computing and storage resources, making storing, analyzing, and applying data incredibly easy.
Moreover, the most outstanding value of the data lake lies in the unification of various data formats within the enterprise and the capability to analyze data in multiple ways on top of one piece of data with cost-effective and efficient mining. Since 2010, when the idea of the data lake was first proposed, cloud service providers have played an essential role in its implementation.
In the cloud-native age, we deploy data lakes in a cloud-native manner. When people hear the term cloud-native, they immediately think of serverless, containerization, etc. However, in recent years, the term has been extended to cover a wide range of products and services.
Essentially, cloud-native is a paradigm for designing distributed systems with resilience, security, stability, and other advantages that can be maximized to enhance performance.
The data lake can benefit from the performance enhancement that the cloud provides. An advantage of cloud computing is its high availability. Compared with on-premise IDC, cloud computing offers more redundancy of resources and can seamlessly switch to other nodes in the event of a failure to ensure business continuity.
Meanwhile, it exhibits resilience. Due to its scalability and affordability, cloud computing can solve the problems associated with massive business volumes and handle the enormous scale of resources and the emergent nature of big data analytics.
The final factor is agility. By eliminating repetitive and complex IT work, the cloud enables enterprises to iterate, deploy, operate, and innovate quickly.
Furthermore, data lakes can optimize performance more effectively in a cloud-native environment through features such as analytics acceleration from a rich context, real-time data value mining from the convergence of stream and batch processing, as well as security and quality improvements with a one-stop solution for data management.
Enterprises can effectively utilize the public cloud infrastructure, and data lake platforms now have a greater range of technology options, including pure hosted storage on the cloud, which can gradually replace HDFS as the storage infrastructure, and the engine richness continues to improve. By leveraging the cloud's unique characteristics of "pooling, resilience, and agility", many data and application layers can be realized, and cloud-native becomes a natural choice for data lakes and even big data.
The future of cloud-native data lakes
Essentially, cloud-native data lakes are new technical products developed by big data computing platforms with the help of cloud computing theory, which supports flexible heterogeneous data storage and resilient scaling of computing resources and helps enterprises cope with the current business requirements of more and more complex data structures and data processing timeliness.
Therefore, cloud-native data lakes are only an architectural principle, and there are a number of ways to implement them, including EMR and Flink solutions.
Although data lake technology is developing rapidly in China with more public cloud vendors making innovations, the implementation of data lakes still faces many difficulties.
There are currently barriers and difficulties in the data-aware collection, categorization, cleaning, and lack of experience in data lake modeling. In general, the overall development of China's data lake market is at an early stage, with inconsistent roadmaps and chaotic product capabilities in the industry.
At the product level, the data governance capability and total link strength of the data lake still need to be further bolstered.
Data governance requires the inclusion of data classification and rules in the directory. Suppose an enterprise's control over the data lake is insufficient. In that case, it will lead to the poor design of the data lake directory and overall architecture. The data in a lake will not be adequately archived or maintained, making a data lake a data swamp. Due to the lack of contextual metadata association, the data swamp cannot be retrieved, resulting in users being unable to analyze and utilize the data effectively.
Chinese domestic vendors that provide total–link cloud-native data lake services are currently insufficient, and most only support data lake components. The downstream companies are therefore limited to relying on multiple vendors for data collection, governance, analysis, and visualization.
Furthermore, application-level training and industry awareness are lacking for cloud-native data lakes. Professional employees are in high demand by enterprises as the big data and artificial intelligence technology stack continue to evolve. Sometimes, managers have little knowledge of data governance and blindly build a data lake without thoroughly analyzing the current situation, leading to poor implementation of business. Despite the widespread recognition of the value of data, the data lake has faced many challenges in promoting and raising awareness, as many enterprises remain cautious and wait and see.
Aside from this, as enterprises move toward digital transformation, data has become one of the most critical production factors, and one of the biggest risks is security, particularly access control. There is a large amount of data entering the lake without any regulation. Once specific data contains privacy and regulatory requirements that other data does not, data leakage and loss will likely occur, resulting in incalculable consequences.
A new industry faces numerous challenges during its early stages, but imperfections are precisely what enable a business to grow. As the "China Cloud-Native Data Lake Application White Paper" of iResearch shows, the maturity of the big data industry in China was boosted by favorable national policies, such as the "Action Plan for Promoting the Development of Big Data", the "Implementation Plan for National Big Data Center and Collaborative Innovation System of Computing Hub" and other documents related to the advancement of Internet technology and digital transformation.
As China's market for cloud-native data lakes is expected to grow at a 39.7% CAGR over the next five years, it is vital for us to keep an eye on the development in the near future.