I have to Confess,
I still Love Pandas

How familar

are you with Pandas?

pet peeve



What is your

Number 1.

Not able to handle large datasets

Have you tried...

If you know the type of the csv, try using dtype optionl

            usecols=['movieId', 'title'],
            dtype={'movieId': 'int32', 'title': 'str'})

Not every numbers need to be float64 or int64

Bonus: no more missed guesses and faster

Have you tried...

Processing in chunks

chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
    for chunk in reader:

Especially useful when the processed data is then stored is databases or disks

Number 2.

Clumsy to use

Got confused some times...

Have you noticed...

There are multiple ways of doing things, for example:

  • Getting a column: df.user vs df['user']
  • Sum, min, max etc: sum(df) vs df.sum()
  • Missing values: isnull vs isna

I prefer the latter to the former for reasons:

1. practicality

2. performance

3. popularity

Have you tried...

Flattening the multi-index


Very useful after groupby, aggregation and pivot_table

* check out this blog post

Have you tried...

Using SQL query to DataFrame

df_movies_cnt.query('count >= @self.movie_rating_thres')

Handy if you are familiar with SQL

Number 3.

Series? DataFrAme?? WT*

Have you noticed...

# gives you a Series
# gives you a DataFrame with 1 column
  • Series is a 1D labelled array capable of holding any data type.
  • DataFrame is a 2D labelled data structure with columns of potentially different types.
  • DataFrame Is consists of one or more Series (columns)


Still very useful skill to have

Way to master pandas:

  • James Powell @dontusethiscode
  • Find your way of doing things and stick to it
  • Look at how others are doing it can compare how you do it
  • Understand the very complication nature of pandas
  • Contribute to it

Details: https://www.euroscipy.org/2022/

See you there ❤️

I still love pandas

By Cheuk Ting Ho

I still love pandas

  • 410