DATAFRAMES

WHY Data Scientists don't use SQL

Online transaction processing (OLTP):

transaction safe
CRUD (Create, Read, Update, Delete) rows of data
data stored row-wise
SQL
LINQ (C#)
Examples: PostgreSQL, MySQL, SQLite, ...

OLTP

see: Wikipedia:LINQ

Online analytical processing (OLAP):

complex queries
reading, aggregation, filtering of historical data
data stored column-wise
MDX (MultiDimensional eXpressions)
ELT pipeline to ingest data
Examples: Clickhouse, Apache Druid, IBM Cognos, ...

OLAP

Stitchdata.com: OLTP and OLAP: a practical comparison

Wikipedia:Comparison_of_OLAP_servers

in-memory OLAP tables
complex queries
reading, aggregation, filtering of historical data
data stored column-wise
custom query API
Examples: pandas (Python), data.tables (R), pola-rs (Rust)

DATAFRAMES

see: https://www.pola.rs

Python dataframes

In-memory, single-threaded: pandas
In-memory, parallel on clusters: dask, ray
Out-of-memory: vaex, modin
Out-of-memory, parallel on clusters: pyspark
In-memory, GPU-accelerated: RAPIDS cuDF
In-memory, multi-threaded: pola-rs

pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray

flying-high-with-vaex

ist-es-zeit-matplotlib-und-pandas-zu-ersetzen-a-962275

R dataframes

In-memory, multithreaded: data.table
In-memory, out-of-memory, on clusters: dplyr

https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

https://atrebas.github.io/post/2020-06-14-datatable-pandas/

https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly

Other dataframes

Benchmark of dataframes

https://h2oai.github.io/db-benchmark/

SQL PROS

in-memory SQL DBs are as fast as dataframes
no need to learn a new query language/API
SQL statements are shorter
SQL query optimiser (multithreaded)
easy to scale data processing steps to persistent DBs
SQL is a battle-proof industry standard

SQL CONS

distributed SQL on clusters very difficult
SQL standard has many dialects
A single huge and confusing SQL statement is needed for best performance

SQL is illogical (fixed with LINQ)

```
SELECT column FROM table
```
```
FROM table SELECT column
```

SQL strings are not typesafe (fixed with LINQ)
No GPU support

DATAFrames PROS

in-memory dataframes need no database connection
object oriented and type-safe query APIs
API better suited for data exploration
Easy to scale out to clusters
GPU support

DATAFrames CONS

every dataframe has a different query API
difficult to scale data processing steps to persistent DBs (but see DuckDB)
No query optimizer (except pyspark)
Out-of-memory dataframe performance often slow
No transaction safety
No concurrent multi-user access

SQL

query = """
    select
        a.name,
        p.year,
        p.title,
        p.abstract,
        p.paper_text
    from authors a
    inner join paper_authors pa on pa.author_id = a.id
    left join papers p on p.id = pa.paper_id
        and p.event_type not in ('Oral', 'Spotlight', 'Poster')
    order by name, year asc
"""
pd.read_sql(query, con=con)

Pandas

df_authors \
    .merge(
        df_papers_authors.drop(columns=["id"]),
        left_on="id",
        right_on="author_id",
        how="inner") \
    .drop(columns=["id"]) \
    .merge(
        df_papers \
            .query("event_type in ('Oral', 'Spotlight', 'Poster')") \
            [["id", "year", "title", "abstract", "paper_text"]],
        left_on="paper_id",
        right_on="id",
        how="left") \
    .drop(columns=["id", "paper_id", "author_id"]) \
    .sort_values(by=["name", "year"], ascending=True)

https://zerowithdot.com/sql-or-pandas/

OLAP + OLTR combined:

SAP HANA is an OLTAP system

it is a column-oriented, in-memory database, that combines OLAP and OLTP operations into a single system

But it better be open-source, easy-to-install and support more client languages
Other interesting developments:
- pola-rs
- trino
- DuckDB
- Clickhouse
see https://db-engines.com/en/ranking

FUTURE

THE END

dataframes

By andimai

dataframes

A first look

198

andimai