id, name, age
1, Alice, 23
2, Bob, 34
3, Carol, 29
Create Sample Data:
sample_data.csv
.
glue-demo-SOME-RANDOM-NUMBER
.sample_data.csv
.
1. Navigate to AWS Glue:
2. Create an IAM Role for Glue:
AWSGlueServiceRole
and AmazonS3FullAccess
.GlueDemoRole
, and create it.3. Create a Glue Crawler:
SampleDataCrawler
.GlueDemoRole
).sample_db
.4. Run the Crawler:
SampleTransformationJob
.GlueDemoRole
.import sys
from awsglue.transforms import *
from awsglue.dynamicframe import DynamicFrame
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import when
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## Data source
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sample_db", table_name = "sample_data_csv",
transformation_ctx = "datasource0")
## Transformation
# Convert to DataFrame for more complex operations
df = datasource0.toDF()
# Add a new column based on the age
df = df.withColumn("category", when(df.age < 30, 'Young').otherwise('Adult'))
# Filter out records with age < 18
df = df.filter(df.age >= 18)
# Convert back to DynamicFrame
transformed_dyF = DynamicFrame.fromDF(df, glueContext, "transformed_dyF")
## Data sink
datasink4 = glueContext.write_dynamic_frame.from_options(frame = transformed_dyF, connection_type = "s3",
connection_options = {"path": "s3://glue-demo-6100/processed_data/"},
format = "parquet", transformation_ctx = "datasink4")
job.commit()
Check the S3 Bucket:
processed_data
folder in your bucket.Download and Inspect the Data:
Note: The bucket must be empty before it can be deleted.
Find your bucket in the list of S3 buckets.
Empty the bucket:
Delete the bucket:
Open the AWS Glue Console.
In the navigation pane, under the ETL section, choose “Jobs”.
Select the job you want to delete.
Choose “Action”, then “Delete job”.
Confirm the deletion when prompted.
Open the IAM Console.
In the navigation pane, click on “Roles”.
Find and select the role you wish to delete.
Click on the “Delete role” button.
Confirm the deletion.
Open the AWS Glue Console.
In the navigation pane, under the ETL section, choose “Crawlers”.
Select the crawler you want to delete.
Choose “Action”, then “Delete crawler”.
Confirm the deletion.
Note: Ensure that the database is not being used by any crawlers, jobs, or other resources.
Open the AWS Glue Console.
In the navigation pane, under the Databases section, choose “Databases”.
Select the database you want to delete.
Choose “Action”, then “Delete database”.
Confirm the deletion.