In 2010 we witnessed the rise of the data lake – a cost-effective way for organizations to store and manage their raw data. A data lake was a single repository for storing all organizational data, including data generated from all internal transactions and interactions as well as from third-party and public sources.
Data lakes promised a powerful, agile, and cost-effective environment that could enable scaling out the environment to assemble, prepare, enrich, and analyze diverse structured and unstructured data sources. But unfortunately, for a long time, these data lakes only stored data but failed to generate any value. They just ended up being mere desolate dumping grounds while the deluge of data around us continued to increase. But today, data lakes are recapturing their lost luster. And we have the cloud to thank for that.
We have previously visited the topic of why moving your data lake to the cloud makes sense. Scalability, dynamic processing capabilities, compliance to navigate the governance and compliance minefield, access to the latest tools and tech to play well with the data, better business continuity capabilities and the significant cost advantage are the usual benefits of this move.
But can you haul your data lake and move it to the cloud?
Moving your data lake to the cloud needs preparation and a rock-solid strategy. It is not an overnight process, and you have a long list of things to consider.
Account for the needs of your modern-day data lake
Data lakes are diverse. They have different shapes and sizes. They are used for different purposes. Some businesses house their data lakes on Hadoop Distributed File System (HDFS) and store large files across the Hadoop clusters. Others do this and also have data lakes outside of HDFS and need to process data in storage systems (like the Amazon Simple Storage Service (S3)).
When moving to the data lake to the cloud, you must the initial efforts should focus on picking and choosing the right use cases for this migration. Moving the data lake to the cloud is not an ‘all or nothing’ proposition. Choose those use cases that have high business value, will be easy to migrate and will be able to show results faster.
The on-premise to in-cloud journey
Moving the data lake to the cloud requires meticulous planning. This journey usually has four phases:
- Greenfield: Identify a small set of use cases in a particular business line and then move that infrastructure to the cloud
- Hybrid: Move a percentage of the data in the cloud and the rest can be on-premise. Establish your processes to manage this new environment and build an integration with the on-premise platform and the cloud to satisfy all user needs.
- Full cloud: Leverage the ‘hybrid’ experience to move the data lake entirely to the cloud.
- Multi-cloud: Assess how you can make the platform agnostic to facilitate data movement across cloud providers and technologies (such as AWS, Azure. Google, etc.). You can also evaluate how to containerize applications to enable robust scalability
You need to determine which cluster will suit your needs for specific jobs.
Technology choices
Your cloud environment will determine the kind of technologies you use. In a single-cloud environment, for example, it makes sense to choose native components and technologies that are tightly integrated with the rest of the eco-system. This helps to create a solution that enables accelerated development. This, conversely, also means that your application will be more dependent on those services. Evaluating this deeply is essential to determine the correct approach to take with your business.
Security
As the volume of data increases, especially sensitive data, in the data lake, security has to be a top priority. Along with ensuring that client data is secure you also need to evaluate certain other security parameters such as data encryption at rest, data transfer monitoring, administrator monitoring, etc.
You also need to have strong security audits (such as s ISO 27001, FedRAMP, DoD SRG, and PCI DSS) and robust security key management controls, have systems in place for intrusion detection and prevention to ensure that data is secure while at rest and when in motion.
Compute Needs
Compute needs differ according to the analytics needs. Streaming analytics, for example, need high throughput while batch may be more process-intensive. AI might work better with GPU’s while Spark might consume more memory. The list can go on.
When moving the data lake to the cloud it makes sense to tie storage directly to compute in each node. Focus on storage-optimized, computer-optimized, or memory-optimized, instances. By decoupling compute from storage, you will also be able to use more specialized resources flexibly and efficiently.
You can also leverage compute-on-demand using transient clusters. Evaluate carefully where you can leverage this to reduce overall costs owing to its pay-for-what-you-use model. But metadata management here can be a challenge
Manage metadata, its challenges, and business definitions
The storage and compute clusters in the cloud are separate. Metadata is also automatically deleted by cloud providers when a cluster is shut down in transient clusters. It, thus, becomes essential to apply a data lake management platform to maintain the metadata. Clearly outline business definitions to understand what tables are in the data lake, their location, and their structure.
Having such a platform also helps to track data lineage especially in a hybrid environment. Data in such an environment comes from different sources, it moves to different clusters and is combined and enriched. This also contributes to making the data lifecycle more complex. Given the strict compliance and regulatory landscape, proper management of the data lifecycle and tracking lineage are essential.
Test often, test continuously
Yes, we need to focus on testing when migrating a data lake to the cloud. You might have a great new environment for your data lake, but one weak link can bring this whole migration down. It, therefore, makes sense to test this new environment continuously and as elaborately as possible to ensure that no critical piece of the migration process is left untouched. A strong testing focus also makes sure that no vulnerability, no matter how small, is not able to touch the cloud data lake environment negatively.
Clearly, in the age of the cloud, moving the data lake there makes complete sense. However, we must proceed with cautious planning and ascertain that we also have the right resources, who have crystal clear clarity on their roles and responsibilities, in place. It is these resources who will take this endeavor to its destination successfully and give you the advantage of enjoying more opportunities to drive business growth.