AWS Glue - Hands On Demo

id, name, age
1, Alice, 23
2, Bob, 34
3, Carol, 29

Step 1: Creating Sample CSV Data

Create Sample Data:

Open a text editor and create a CSV file named sample_data.csv.
Add the following content:

Step 2: Store Sample Data in AWS S3

Log in to AWS Console
Navigate to S3:
- Go to Services → S3.
Create a New Bucket:
- Click “Create bucket”.
- Give it a unique name, e.g., glue-demo-SOME-RANDOM-NUMBER.
- Select the region.
- Click “Create”.
Upload Sample Data:
- Open the newly created bucket.
- Click “Upload” and upload sample_data.csv.

Step 3: Set Up AWS Glue

1. Navigate to AWS Glue:

Go to Services → Glue.

2. Create an IAM Role for Glue:

Go to Services → IAM → Roles.
Click “Create role”.
Select AWS service → Glue.
Attach policies like AWSGlueServiceRole and AmazonS3FullAccess.
Name the role, e.g., GlueDemoRole, and create it.

3. Create a Glue Crawler:

In the Glue console, go to Crawlers → Add crawler.
Name the crawler, e.g., SampleDataCrawler.
Choose the previously created IAM role (GlueDemoRole).
In the data store, choose S3 and the path to your sample data.
Choose “Run on demand”.
Configure the output to a database in Glue Data Catalog, e.g., sample_db.
Review and create the crawler.

4. Run the Crawler:

Select the crawler and click “Run crawler”.
Once completed, it will create a table in the Glue Data Catalog.

Step 4: Data Transformation

Create a Glue ETL Job:
- In the Glue console, go to ETL jobs → Create job → Visual ETL.
- Name the job, e.g., SampleTransformationJob.
- Go to Script Tab and paste the code provided.
- Select the role GlueDemoRole.
- Review the job settings and click “Finish”.

import sys
from awsglue.transforms import *
from awsglue.dynamicframe import DynamicFrame
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import when

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

## Data source
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sample_db", table_name = "sample_data_csv",
                                                            transformation_ctx = "datasource0")

## Transformation
# Convert to DataFrame for more complex operations
df = datasource0.toDF()

# Add a new column based on the age
df = df.withColumn("category", when(df.age < 30, 'Young').otherwise('Adult'))

# Filter out records with age < 18
df = df.filter(df.age >= 18)

# Convert back to DynamicFrame
transformed_dyF = DynamicFrame.fromDF(df, glueContext, "transformed_dyF")

## Data sink
datasink4 = glueContext.write_dynamic_frame.from_options(frame = transformed_dyF, connection_type = "s3",
                                                         connection_options = {"path": "s3://glue-demo-6100/processed_data/"},
                                                         format = "parquet", transformation_ctx = "datasink4")
job.commit()

Run the ETL Job:

Click “Run job” and wait for the job to complete. This process will read the data from the source CSV file, perform any transformations defined in the script (if any), and write the result to the target location in S3.

Step 5: Verify the Output

Check the S3 Bucket:
- Go back to the S3 console.
- Navigate to the processed_data folder in your bucket.
- You should see the transformed data files here.
Download and Inspect the Data:
- Download the transformed data file to your local machine.
- Open it to ensure it has been transformed as expected.

Delete S3 Bucket

Note: The bucket must be empty before it can be deleted.

Find your bucket in the list of S3 buckets.
Empty the bucket:
- Click on the bucket name.
- Select all the files and folders inside.
- Click on the “Delete” button.
- Confirm the deletion.
Delete the bucket:
- Go back to the list of buckets.
- Select the bucket you want to delete.
- Click on the “Delete” button.
- You will be prompted to confirm the deletion by entering the bucket name. Do so, then click “Confirm”.

Delete AWS Glue Job

Open the AWS Glue Console.
In the navigation pane, under the ETL section, choose “Jobs”.
Select the job you want to delete.
Choose “Action”, then “Delete job”.
Confirm the deletion when prompted.

Delete IAM Role

Open the IAM Console.
In the navigation pane, click on “Roles”.
Find and select the role you wish to delete.
Click on the “Delete role” button.
Confirm the deletion.

Delete Glue Crawler

Open the AWS Glue Console.
In the navigation pane, under the ETL section, choose “Crawlers”.
Select the crawler you want to delete.
Choose “Action”, then “Delete crawler”.
Confirm the deletion.

Delete Glue Data Catalog Database

Note: Ensure that the database is not being used by any crawlers, jobs, or other resources.

Open the AWS Glue Console.
In the navigation pane, under the Databases section, choose “Databases”.
Select the database you want to delete.
Choose “Action”, then “Delete database”.
Confirm the deletion.