Quick Apache Arrow


28 Feb 2019

I first heard about Arrow last year

Why would anyone need a pipeline between R and Python?

...or is it a broader need?

Drawing of multiple big data storage and analytics platforms connected to each other via multiple edges.
Drawing of multiple storage and analytics platforms connected to a single in-memory concept of data.



Images lifted from arrow.apache.org/

Drawing of multiple big data storage and analytics platforms connected to each other via multiple edges.


Image lifted from arrow.apache.org/


  • Every platform *still* has its unique representation of a dataset.
  • But communication via only one common in-memory format reduces complexity.
Drawing of multiple storage and analytics platforms connected to a single in-memory concept of data.

Image lifted from arrow.apache.org/


Where did the need come from?

It all started in 2004*

Drawing of Dave Matthews Band with poo

*Except for the Google File System paper (HDFS) from 2003 (Ghemawat, Gobioff, Leung)

In Chicago, people were throwing poo in the river...

...in California

Jeff Dean and Sanjay Ghemawat were finishing their OSDI paper: MapReduce, the inspiration for Hadoop.

image from giphy search for California

Google logo


Teradata Aster logoo
Vertica Logo
Hadoop logo


Amazon DynamoDB logo



(also Wes McKinney starts learning Python) *source: DataCamp podcast

CHUG-Chicago Hadoop User Group logo


Apache Pig logo
Apache Hive Logo
IBM Netezza Logo




HP Logo
Teradata Aster logoo

Big data meets Pandas


The backstory intersects paths with Wes McKinney in 2015, while he's at Cloudera. (story in the DataCamp interview above)

Drawing of multiple big data storage and analytics platforms connected to each other via multiple edges.
Drawing of multiple storage and analytics platforms connected to a single in-memory concept of data.
  • Leaders from multiple infrastructure projects
  • All developing multiple format conversion tools
  • Why not just one in-memory standard?

Now we know "Why"

So, *what* is Arrow?


  • Feather (flatbuffer for serialization Pandas R)

    • proof of concept

  • Parquet (on-disk columnar storage format)

  • Arrow (in-memory columnar format)

    • C++, R, Python (use the C++ bindings) even Matlab

    • Go, Rust, Ruby, Java, Javascript (reimplemented)

  • Plasma (in-memory shared object store)

  • Gandiva (SQL engine for Arrow)

  • Flight (remote procedure calls based on gRPC)


(A proof of concept; still in codebase)

Python (write)

R (read)

import pandas as pd
import pyarrow.feather as feather
import numpy as np

x = np.random.sample(1000)
y = 10 * x**2 + 2 * x + 0.05 * np.random.sample(1000)
df = pd.DataFrame(dict(x=x, y=y))


dataframe <- read_feather('testing.feather')
plot(y ~ x, data=dataframe)


On-disk data storage format

(Joined with Arrow December 2018)

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(
    {'one':[-1, np.nan], 'two':['foo', 'bar'], 'three':[True, False]},
table = pa.Table.from_pandas(df)
pq.write_table(table, 'example.parquet')



import pyarrow.parquet as pq

table2 = pq.read_table('example.parquet')
df = table2.to_pandas()


  • Why not just use Parquet? (or ORC)
    • Those are designed for on-disk storage and have
      • compression
      • optional separate files for metadata
    • Arrow is in-memory and is
      • for speed of access
      • for cross-framework use (i.e. polyglot pickles)

(cross-language in-memory columnar data format)

Column store

Like second-generation on-disk data stores (Parquet, ORC, etc.) in-memory columnar layouts allow faster computation over columns (mean, stdev, count of categories, etc.)

Image lifted from arrow.apache.org/


(speed example: Spark, from the blog)

      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT

Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
SparkSession available as 'spark'.

In [1]: from pyspark.sql.functions import rand
   ...: df = spark.range(1 << 22).toDF("id").withColumn("x", rand())
   ...: df.printSchema()
 |-- id: long (nullable = false)
 |-- x: double (nullable = false)

In [2]: %time pdf = df.toPandas()
CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s
Wall time: 20.7 s

In [3]: spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [4]: %time pdf = df.toPandas()
CPU times: user 40 ms, sys: 32 ms, total: 72 ms                                 
Wall time: 737 ms

not the default



import pyarrow as pa

arr1 = pa.array([1,2])
arr2 = pa.array([3,4])

field1 = pa.field('col1', pa.int64())
field2 = pa.field('col2', pa.int64())

field1 = field1.add_metadata({'meta': 'foo'})
field2 = field2.add_metadata({'meta': 'bar'})

col1 = pa.column(field1, arr1)
col2 = pa.column(field2, arr2)

table = pa.Table.from_arrays([col1, col2])

batches = table.to_batches(chunksize=1)

#    col1  col2
# 0     1     3
# 1     2     4


An array is a sequence of values with known length and type.

A column name + data type

+ metadata = a field

A field + an array = a column

A table is a set of columns

You can split a table into row batches

You can convert between pyarrow tables and pandas data frames

(both directions)


  • In-memory object store. Documentation here.
  • The goal is zero-copy data exchange between frameworks

(announced mid-2017)

# Create an object.
object_id = pyarrow.plasma.ObjectID(20 * b'a')
object_size = 1000
buffer = memoryview(client.create(object_id, object_size))

# Write to the buffer.
for i in range(1000):
    buffer[i] = 0

# Seal the object making it immutable and available to other clients.
# Get the object from the store. This blocks until the object has been sealed.
object_id = pyarrow.plasma.ObjectID(20 * b'a')
[buff] = client.get([object_id])
buffer = memoryview(buff)



Can be copied because it's now immutable.


  • Uses LLVM to JIT-compile SQL queries on the in-memory Arrow data
  • The docs on the original page have literal SQL not ORM-SQL which you feed as a string to the compiler then execute

(Donated by Dremio November 2018)

Named after a mythical bow from an Indian legend that  makes the arrows it fires 1000 times more powerful.


  • Goal is to reduce deserialization time during intra-machine communication
  • Built using of gRPC, a cross-language universal remote procedure call framework

(Announced as a new initiative in late 2018)

What the future holds

Wes McKinney has been thinking out loud on the listserv about the future of the project, and posted some ideas here.

(Note to Tanya: Click on this. It will be the start point for discussions. Scroll to "Goals")


  • Apache Roadshow in Chicago May 13-14
  • https://arrow.apache.org/
  • https://issues.apache.org/jira/projects/ARROW
  • https://cwiki.apache.org/confluence/display/ARROW


Tanya's beautiful Mom, sharing a muffin

To my beautiful Mom, whose only joy in life was her children's happiness.

Rest in peace. You did a wonderful job. I love you.

Quick Apache Arrow

By Tanya Schlusser

Quick Apache Arrow

  • 291
Loading comments...

More from Tanya Schlusser