Using Pandas for Data Science

Tips for New Users, Experts, and Everyone in Between

How familar

are you with Pandas?

Pandas

Is difficult to master?

Do you think

Number 1.

Set the Dtypes of your data

Have you tried...

If you know the type of the csv, try using dtype optionl

pd.read_csv(os.path.join(self.path_movies),
            usecols=['movieId', 'title'],
            dtype={'movieId': 'int32', 'title': 'str'})

Not every numbers need to be float64 or int64

Bonus: no more missed guesses and faster

Number 2.

Handle Data By Batch

Have you tried...

Processing in chunks

chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
    for chunk in reader:
        process(chunk)

Especially useful when the processed data is then stored is databases or disks

Number 3.

choosing better "way" of doing things

Have you noticed...

There are multiple ways of doing things, for example:

  • Getting a column: df.user vs df['user']
  • Sum, min, max etc: sum(df) vs df.sum()
  • Missing values: isnull vs isna

I prefer the latter to the former for reasons:

1. practicality

2. performance

3. popularity

Number 4.

Remember to flatten your dataframe

Have you tried...

Flattening the multi-index

df.reset_index()

Very useful after groupby, aggregation and pivot_table

* check out this blog post

Number 5.

Use Json Normalized

To Flatten DataFrames

Talking about flatten...

About Json Normalized

Number 6.

Use SQL Queries

To manipulate Dataframes

Have you tried...

Using SQL query to DataFrame

df_movies_cnt.query('count >= @self.movie_rating_thres')

Handy if you are familiar with SQL

Number 7.

Understand Series and DataFrAmes

Have you noticed...

df['user']
# gives you a Series
df[['user']]
# gives you a DataFrame with 1 column
  • Series is a 1D labelled array capable of holding any data type.
  • DataFrame is a 2D labelled data structure with columns of potentially different types.
  • DataFrame Is consists of one or more Series (columns)

Number 8.

Understand GroupBy Objects

about groupby

Number 9.

Using Pandas Options

Number 10.

Quick Plotting with Pandas

Pandas

Still very useful skill to have

Way to master pandas:

  • James Powell @dontusethiscode
  • Find your way of doing things and stick to it
  • Look at how others are doing it can compare how you do it
  • Understand the very complication nature of pandas
  • Contribute to it

USING PANDAS FOR DATA SCIENCE

By Cheuk Ting Ho

USING PANDAS FOR DATA SCIENCE

  • 537