Amazon EMR - Hands On Demo
Create a Sample Data CSV File.
1. Launch AWS Cloud Shell in us-east-1 region.
2. Copy and paste the following code.
cat > data.csv << EOF
UserID,PurchaseAmount,Date
1,100,2023-01-01
1,120,2023-01-02
1,130,2023-01-03
2,150,2023-01-04
2,160,2023-01-05
2,170,2023-01-06
3,100,2023-01-07
3,110,2023-01-08
3,90,2023-01-09
4,200,2023-01-10
4,210,2023-01-11
4,220,2023-01-12
5,150,2023-01-13
5,160,2023-01-14
5,170,2023-01-15
6,100,2023-01-16
6,130,2023-01-17
6,140,2023-01-18
7,110,2023-01-19
7,120,2023-01-20
7,130,2023-01-21
8,200,2023-01-22
8,210,2023-01-23
8,220,2023-01-24
9,150,2023-01-25
9,160,2023-01-26
9,170,2023-01-27
10,100,2023-01-28
10,110,2023-01-29
10,120,2023-01-30
1,140,2023-01-31
1,150,2023-02-01
2,180,2023-02-02
2,190,2023-02-03
3,120,2023-02-04
3,130,2023-02-05
4,230,2023-02-06
4,240,2023-02-07
5,180,2023-02-08
5,190,2023-02-09
EOF
Create EC2 Key Pair to logon to EC2 Machine Later On.
Copy and paste the following code.
aws ec2 create-key-pair \
--key-name emr-key-pair \
--key-type rsa \
--key-format pem \
--query "KeyMaterial" \
--output text > emr-key-pair.pem
Create S3 Bucket to store Sample Data.
Copy and paste the following code.
AWS_PAGER=""
RANDOMNUM=$RANDOM
BUCKET_NAME="emrdemo-$RANDOMNUM"
aws s3api create-bucket \
--bucket $BUCKET_NAME \
--region us-east-1
Copy Sample Data to S3 Bucket.
Copy and paste the following code.
# Copy data.csv to the S3 bucket
aws s3 cp data.csv s3://$BUCKET_NAME/data.csv
Create a dedicated VPC to host your EMR Cluster shown as below - part 1
Create a dedicated VPC to host your EMR Cluster shown as below - part 2
Create a dedicated VPC to host your EMR Cluster shown as below - part 3
Create an Amazon EMR Cluster as follows - Name and applications
Create an Amazon EMR Cluster as follows - Cluster Configuration
Create an Amazon EMR Cluster as follows - Networking
Create an Amazon EMR Cluster as follows - Cluster Logs
Create an Amazon EMR Cluster as follows - Security configuration and EC2 key pair
Create an Amazon EMR Cluster as follows - Amazon EMR service role
Create an Amazon EMR Cluster as follows - EC2 instance profile for Amazon EMR
Wait for some time till Cluster Status changes to Waiting
Open an Inbound Rule to SSH from Cloud Shell to Master Node
Edit Inbound Rules and Save
SSH from Cloud Shell Instructions are available to EMR Cluster Summary Page
SSH from Cloud Shell
chmod 600 ~/emr-key-pair.pem
ssh -i ~/emr-key-pair.pem hadoop@ec2-54-145-16-237.compute-1.amazonaws.com
Create Apache Spark PySpark program to execute some aggregation.
Copy and paste the following code.
cat > test.py << EOF
import argparse
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
from pyspark.sql.types import DateType
def aggregate_data(data_source, output_uri):
with SparkSession.builder.appName("SampleAggregation").getOrCreate() as spark:
# Load the CSV data
df = spark.read.csv(data_source, header=True, inferSchema=True)
# Processing logic
df = df.withColumn("Date", col("Date").cast(DateType()))
filtered_df = df.filter(col("Date") >= lit("2023-01-10"))
aggregated_df = filtered_df.groupBy("UserID").sum("PurchaseAmount")
# Write the results
aggregated_df.write.mode("overwrite").csv(output_uri, header=True)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--data_source', help="The URI for your CSV data, like an S3 bucket location.")
parser.add_argument('--output_uri', help="The URI where output is saved, like an S3 bucket location.")
args = parser.parse_args()
aggregate_data(args.data_source, args.output_uri)
EOF
Execute the Below Command to Submit the Apache Spark Job
spark-submit test.py \
--data_source s3://emrdemo-28248/data.csv \
--output_uri s3://emrdemo-28248/output
Verify S3 Output folder
Verify Output Result
Clean Up
1. Terminate EMR Cluster
2. Delete S3 Buckets
3. Delete IAM Roles
4. Delete VPC
Thanks
For
Watching
Amazon EMR - Hands On Demo
By Deepak Dubey
Amazon EMR - Hands On Demo
Amazon EMR - Hands On Demo
- 56