Serverless SQL Queries

Lynn Langit

What
Is
Data?

Fast, Cheap SQL from Thin Air

Got $ 50 ?

Cost?  ~ $ 25 USD

Big Query

Serverless SQL since 2011

  • LOAD CSV, JSON, Avro
  • QUERY via ANSI SQL 
  • Includes public datasets
  • Pay by Query Execution
  • Cost control - alerts/quotas

Serverless SQL

Details

  • Partition tables by date
  • Stream 100k rows/sec
  • Supports DML
  • Includes JSON functions
  • Supports UDFs
  • Includes JDBC/ODBC
  • Granular Security
  • Cost control - alerts/quotas
  • Integration w/GCP

Serverless SQL

Werner says...

AWS Athena

Use it...

3 Steps

  • Store in S3
  • Define Table Schema
  • Run ANSI SQL query

Use it better...

Save 30-90% $/perf 

  • Compress
  • Partition by any key
  • Convert to columnar

Details

  • Uses Presto - serverless and auto-scaling
  • Supports joins, windows and arrays
  • Input via CSV, JSON, ORC, Avro, Parquet, CloudTrail
  • Includes JDBC driver
  • Supported by QuickSight
  • Secured via S3 IAM permissions and encryption
  • No UDFs or stored_procs
  • Compression - gzip, LZO, snappy or zlib
  • Query results streamed to console and also stored in S3

Devs        NoOps

Code

CREATE EXTERNAL TABLE ....

SELECT...FROM
 WHERE...GROUP BY...HAVING...ORDER BY

ALTER TABLE ...ADD PARTITION

Libraries

var JDBC = require('jdbc');
var jinst = require('jdbc/lib/jinst');
 
if (!jinst.isJvmCreated()) {
  jinst.addOption("-Xrs");
  jinst.setupClasspath(['./AthenaJDBC41-*.jar']);
}
 
var config = { 
  url: 'jdbc:awsathena://athena.*.amazonaws.com:443', 
  drivername: 'com.amazonaws.athena.jdbc.AthenaDriver',
  minpoolsize: 10,
  maxpoolsize: 100,
  properties: {
                s3_staging_dir: 's3://aws-athena-query-results-*/',
                log_path: '/logs/athenajdbc.log',
                user: 'access_key',
                password: 'secret_key'
   }
};
 
var hsqldb = new JDBC(config);
 
hsqldb.initialize(function(err) {
  if (err) {
    console.log(err);
  }
});

More Libraries

import java.sql.*;
import java.util.Properties;
import com.amazonaws.athena.jdbc.AthenaDriver;
import com.amazonaws.auth.PropertiesFileCredentialsProvider;


public class AthenaJDBCDemo {

  static final String athenaUrl = "jdbc:awsathena://athena.us-east-1.amazonaws.com:443";

  public static void main(String[] args) {

      Connection conn = null;
      Statement statement = null;

      try {
          Class.forName("com.amazonaws.athena.jdbc.AthenaDriver");
          Properties info = new Properties();
          info.put("s3_staging_dir", "s3://my-athena-result-bucket/test/");
          info.put("log_path", "/Users/myUser/.athena/athenajdbc.log");
          info.put("aws_credentials_provider_class","com.amazonaws.auth.PropertiesFileCredentialsProvider");
          info.put("aws_credentials_provider_arguments","/Users/myUser/.athenaCredentials");
          String databaseName = "default";

          System.out.println("Connecting to Athena...");
          conn = DriverManager.getConnection(athenaUrl, info);

          System.out.println("Listing tables...");
          String sql = "show tables in "+ databaseName;
          statement = conn.createStatement();
          ResultSet rs = statement.executeQuery(sql);

          while (rs.next()) {
              //Retrieve table column.
              String name = rs.getString("tab_name");

              //Display values.
              System.out.println("Name: " + name);
          }
          rs.close();
          conn.close();
      } catch (Exception ex) {
          ex.printStackTrace();
      } finally {
          try {
              if (statement != null)
                  statement.close();
          } catch (Exception ex) {

          }
          try {
              if (conn != null)
                  conn.close();
          } catch (Exception ex) {

              ex.printStackTrace();
          }
      }
      System.out.printf("Finished connectivity test.");
  }
}

Cost

Item AWS Athena GCP Big Query
Query $ 5 / TB scanned $ 5 / TB scanned
Store Data $ 20 / TB** in S3 $ 20 / TB** in BQ

*   Loading and Exporting data is FREE

** Price per TB per month

Cost Example - Athena

Speed

Item AWS Athena GCP Big Query
Query
Input data
Output data

TBD

Use Cases

  • Explore / Experiment
  • Data Warehouse replacement
  • Ad hoc reporting

Use the Right Service...

Spectrum

Athena

Best for 'ad-hoc' SQL queries of data in S3

Redshift Spectrum

Best for 'frequently-accessed' and 'highly-structured' data stored in S3

More about Redshift Spectrum

  • Datatypes - parquet*, textfile, sequencefile, rcfile
  • Compression - gzip, snappy, bz2
  • Encryption - server-side only
  • TIP - break files into equal sizes (100 MB - 1 GB)
  • CREATE EXTERNAL SCHEMA - Athena, Hive or external

Serverless SQL?

Use Case AWS  GCP 
Ad hoc Athena* BigQuery*
Data Warehouse Redshift BigQuery*
Batch Transform EMR / Spark Dataproc / Spark
Streaming Manual Process BigQuery*

*Serverless

Wait...what about
serverless Spark SQL?

Databricks Catalyst

Well actually...

Well actually...

Not quite yet 

Serverless SQL

@LynnLangit

Made with Slides.com