I have to Confess,
I still Love Pandas

Cheuk Ting Ho

Cheukting

@cheukting_ho

https://cheuk.dev

How familar

are you with Pandas?

pet peeve

about

Pandas

What is your

https://www.twitch.tv/cheukting_ho

Number 1.

Not able to handle large datasets

Have you tried...

If you know the type of the csv, try using dtype optionl

pd.read_csv(os.path.join(self.path_movies),
            usecols=['movieId', 'title'],
            dtype={'movieId': 'int32', 'title': 'str'})

Not every numbers need to be float64 or int64

Bonus: no more missed guesses and faster

Have you tried...

Processing in chunks

chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
    for chunk in reader:
        process(chunk)

Especially useful when the processed data is then stored is databases or disks

Number 2.

Clumsy to use

Got confused some times...

Have you noticed...

There are multiple ways of doing things, for example:

Getting a column: df.user vs df['user']
Sum, min, max etc: sum(df) vs df.sum()
Missing values: isnull vs isna

I prefer the latter to the former for reasons:

1. practicality

2. performance

3. popularity

Have you tried...

Flattening the multi-index

df.reset_index()

Very useful after groupby, aggregation and pivot_table

* check out this blog post

Have you tried...

Using SQL query to DataFrame

df_movies_cnt.query('count >= @self.movie_rating_thres')

Handy if you are familiar with SQL

Number 3.

Series? DataFrAme?? WT*

Have you noticed...

df['user']
# gives you a Series

df[['user']]
# gives you a DataFrame with 1 column

Series is a 1D labelled array capable of holding any data type.
DataFrame is a 2D labelled data structure with columns of potentially different types.
DataFrame Is consists of one or more Series (columns)

Pandas

Still very useful skill to have

Way to master pandas:

James Powell @dontusethiscode
Find your way of doing things and stick to it
Look at how others are doing it can compare how you do it
Understand the very complication nature of pandas
Contribute to it

Details: https://www.euroscipy.org/2022/

See you there ❤️

I still love pandas

By Cheuk Ting Ho

I still love pandas

3 years ago
582

Cheuk Ting Ho

Developer advocate / Data Scientist - support open-source and building the community.

I have to Confess, I still Love Pandas

How familar

pet peeve

about

Pandas

What is your

Number 1.

Not able to handle large datasets

Have you tried...

Have you tried...

Number 2.

Clumsy to use

Got confused some times...

Have you noticed...

Have you tried...

Have you tried...

Number 3.

Series? DataFrAme?? WT*

Have you noticed...

Pandas

I still love pandas

More from Cheuk Ting Ho

I have to Confess,
I still Love Pandas