Step By Step Guide To Implement Data Lake in AWS
Data Lake, It’s not a new term for a techie like you! Many organizations unfold valuable insights and leverage their revenue with this astonishing technique. AWS does not and will not leave the possibilities of using cloud to its fullest. In this article, let’s have a detailed look about implementing data lake in AWS.
Services that can be integrated with data lake
AWS offers a lot of services for very specific functions. The below listed are the services you can integrate with data lake
- Amazon Simple Storage Service (Amazon S3),
- Amazon Redshift,
- Amazon Kinesis,
- Amazon Athena,
- AWS Glue,
- Amazon Elasticsearch Service (Amazon ES),
- Amazon SageMaker,
- and Amazon QuickSight.
These are the fundamental components of data lake. Depending upon the kind of data and business flow, these components interact with each other recurringly.
Features of Data Lake with Integrated AWS Services
- Data ingestion including submissions to Amazon S3 and streaming submissions to Amazon Kinesis
- Processing of incoming data, such as data verification, metadata extraction, and indexing with Amazon S3, Amazon SNS, AWS Lambda, Amazon kinesis data analytics and Amazon ES.
- Dataset administration with Amazon Redshift transformations and Kinesis Data Analytics
- Data transformation by Amazon Athena, Collection with Amazon Redshift Spectrum and data analysis by amazon glue.
- Amazon SageMaker to build and deploy machine learning models.
- Indexing meta data via Amazon ES and display it in Kibana dashboards
- Multiple visualization tools for enhanced visualization effects.
Also Read: Step By Step Guide To Migrate ASP.NET Applications to AWS
How to implement Data Lake in AWS?
Before getting into the data lake implementation, you need to be familiar with the features and services offered by AWS. If not, I highly recommend you to go through these things at Getting Started with AWS.
What you need to get started?
1. An AWS Account
2. IAM user with AWS Lake Formation Data Admin policy
3. AWS S3 bucket
4. “zipcode” named folder within the new S3 bucket.
Download the dataset and upload it in the S3 bucket.
How to Create Data Lake?
#1 Build a data lake administrator
To grant access to any Lake Formation resource, you must first designate yourself as the data lake administrator.
#2 Register AWS S3 path
To store your data in the data lake, register with amazon S3.
#3 Database Creation
Next, build a database using the zipcode table definitions in the AWS Glue Data Catalog.
- Type zipcode-db into the Database field.
- Enter your S3 bucket or zip code under Location.
- Selecting Grant All to Everyone for New tables in this database is not recommended.
#4 Give Permissions
Next, give AWS Glue authorization to access the zipcode-db database. Choose your user and AWSGlueServiceRoleDefault for the IAM role.
Give your user and AWSServiceRoleForLakeFormationDataAccess access to your data lake using a data location so that you may use it:
Enter, s3://datalake-yourname-region for storage location.
#5 To create metadata and table, crawl the data with AWS Glue
To identify the schema for your data, a crawler connects to a data store, works its way through a prioritised list of classifiers, and then produces metadata tables in your AWS Glue Data Catalog.
Make use of the following configuration settings to create a table with AWS Glue crawler
- Give zipcodecrawler as a crawler name
- Select S3 to store the data
- include this path, s3://datalake-yourname-location/zipcode.
- Choose no for adding another data store
- Choose AWSGlueServiceRoleDefault for IAm role
- click on Run on demand and select zipcode-db
- choose run it now and wait for the crawler until it’s done its done.
#6 Give permission to table data
To enable others to manage the data, configure the permissions on your AWS Glue Data Catalog. To provide and revoke access to database tables, use the Lake Formation console.
- Select Tables in the navigation pane
- choose Grant
- Select your user and AWSGlueServiceRoleDefault for IAM role
- Select All for the table permissions.
#7 Use Athena to do a data query.
- Select Query Editor and zipcode-db in the Athena console
- Choose Tables and Zipcode
- Click preview
You will get the below query from Athena
SELECT * FROM “zipcode”.”zipcode” limit 10;
#8 Verify the outcomes after adding a new user with restricted access.
- Choose zipcode-db for database
- Zipcode for Table
- Include teh columns
- Choose the names and participants counts in the columns
- Grant permissions to the table.
Wrapping up!
Hope this article helps you to deploy data lake in AWS. If you still need any help with deploying data lake in your organization, We, Continuuminnovations are ready to assist you at any time. We are the US based Managed Cloud Service Providers offers end-to-end cloud solutions for various industries.
Services We Offer
- AWS Managed Service
- Azure Managed Service
- Data Migration
- Cloud Migration
- Data Analytics