Using Pandas for Data Science

Tips for New Users, Experts, and Everyone in Between

Cheuk Ting Ho

Cheukting

@cheukting_ho

https://cheuk.dev

How familar

are you with Pandas?

Pandas

Is difficult to master?

Do you think

https://www.twitch.tv/cheukting_ho

Number 1.

Set the Dtypes of your data

Have you tried...

If you know the type of the csv, try using dtype optionl

pd.read_csv(os.path.join(self.path_movies),
            usecols=['movieId', 'title'],
            dtype={'movieId': 'int32', 'title': 'str'})

Not every numbers need to be float64 or int64

Bonus: no more missed guesses and faster

Number 2.

Handle Data By Batch

Have you tried...

Processing in chunks

chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
    for chunk in reader:
        process(chunk)

Especially useful when the processed data is then stored is databases or disks

Number 3.

choosing better "way" of doing things

Have you noticed...

There are multiple ways of doing things, for example:

Getting a column: df.user vs df['user']
Sum, min, max etc: sum(df) vs df.sum()
Missing values: isnull vs isna

I prefer the latter to the former for reasons:

1. practicality

2. performance

3. popularity

Number 4.

Remember to flatten your dataframe

Have you tried...

Flattening the multi-index

df.reset_index()

Very useful after groupby, aggregation and pivot_table

* check out this blog post

Number 5.

Use Json Normalized

To Flatten DataFrames

Talking about flatten...

About Json Normalized

Number 6.

Use SQL Queries

To manipulate Dataframes

Have you tried...

Using SQL query to DataFrame

df_movies_cnt.query('count >= @self.movie_rating_thres')

Handy if you are familiar with SQL

Number 7.

Understand Series and DataFrAmes

Have you noticed...

df['user']
# gives you a Series

df[['user']]
# gives you a DataFrame with 1 column

Series is a 1D labelled array capable of holding any data type.
DataFrame is a 2D labelled data structure with columns of potentially different types.
DataFrame Is consists of one or more Series (columns)

Number 8.

Understand GroupBy Objects

about groupby

Number 9.

Using Pandas Options

Number 10.

Quick Plotting with Pandas

Pandas

Still very useful skill to have

Way to master pandas:

James Powell @dontusethiscode
Find your way of doing things and stick to it
Look at how others are doing it can compare how you do it
Understand the very complication nature of pandas
Contribute to it

USING PANDAS FOR DATA SCIENCE

By Cheuk Ting Ho

USING PANDAS FOR DATA SCIENCE

2 years ago
611

Cheuk Ting Ho

Developer advocate / Data Scientist - support open-source and building the community.

Using Pandas for Data Science

How familar

Pandas

Is difficult to master?

Do you think

Number 1.

Set the Dtypes of your data

Have you tried...

Number 2.

Handle Data By Batch

Have you tried...

Number 3.

choosing better "way" of doing things

Have you noticed...

Number 4.

Remember to flatten your dataframe

Have you tried...

Number 5.

Use Json Normalized

To Flatten DataFrames

Talking about flatten...

Number 6.

Use SQL Queries

To manipulate Dataframes

Have you tried...

Number 7.

Understand Series and DataFrAmes

Have you noticed...

Number 8.

Understand GroupBy Objects

Number 9.

Using Pandas Options

Number 10.

Quick Plotting with Pandas

Pandas

USING PANDAS FOR DATA SCIENCE

More from Cheuk Ting Ho