AWS Data Exchange

Hands-On

Demo

In this demo, we will:

  1. Explore the AWS Data Exchange catalog
  2. See how to Subscribe to a dataset
  3. Check out the Covid Data
  4. Set up S3 bucket for data storage
  5. Download and save data from AWS Data Exchange to S3
  6. Access and analyze the data
  7. Clean up resources

Agenda

AWS Data Exchange

COVID-19 Data Lake

Resources on AWS

Usage examples

S3 Bucket 

Download Under csv/ folder

Enigma-JHU.csv.gz

head -n 1000 Enigma-JHU.csv > Enigma-JHU-1000.csv
fips,admin2,province_state,country_region,last_update,latitude,longitude,confirmed,deaths,recovered,active,combined_key
,,Anhui,China,2020-01-22T17:00:00,31.826,117.226,1,,,,"Anhui, China"
,,Beijing,China,2020-01-22T17:00:00,40.182,116.414,14,,,,"Beijing, China"
,,Chongqing,China,2020-01-22T17:00:00,30.057,107.874,6,,,,"Chongqing, China"
,,Fujian,China,2020-01-22T17:00:00,26.079,117.987,1,,,,"Fujian, China"
,,Gansu,China,2020-01-22T17:00:00,36.061,103.834,,,,,"Gansu, China"
code Enigma-JHU-1000.csv

Create S3 bucket

covid19-data-lake-demo-917212

Upload CSV File 

CREATE EXTERNAL TABLE covid_data (
    fips STRING,
    admin2 STRING,
    province_state STRING,
    country_region STRING,
    last_update STRING,
    latitude STRING,
    longitude STRING,
    confirmed STRING,
    deaths STRING,
    recovered STRING,
    active STRING,
    combined_key STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar"     = "\"",
   "escapeChar"    = "\\"
)
STORED AS TEXTFILE
LOCATION 's3://covid19-data-lake-demo-917212/'
TBLPROPERTIES ('skip.header.line.count'='1');

Create Table in Athena

SELECT * FROM covid_data LIMIT 10;

Verify Table Data

Results

SELECT 
    country_region, 
    SUM(CASE 
        WHEN confirmed = '' THEN 0 
        ELSE CAST(confirmed AS BIGINT) 
    END) as total_cases, 
    SUM(CASE 
        WHEN deaths = '' THEN 0 
        ELSE CAST(deaths AS BIGINT) 
    END) as total_deaths
FROM covid_data
GROUP BY country_region
ORDER BY total_cases DESC
LIMIT 10;
  1. This query summarizes COVID-19 data by country, showing the top 10 countries with the highest number of confirmed cases.
  2. It calculates total cases and deaths for each country, handling empty values by treating them as zeros.
  3. The query groups the data by country, sums up the case and death counts, and then sorts the results to show the most affected countries first.

Clean Up

DROP TABLE IF EXISTS covid_data;

Delete File from S3 Bucket

permanently delete

Delete S3 Bucket

🙏

Thanks

for

Watching

AWS Data Exchange - Hands-On Demo

By Deepak Dubey

AWS Data Exchange - Hands-On Demo

AWS Data Exchange - Hands-On Demo

  • 114