Cheuk Ting Ho
Developer advocate / Data Scientist - support open-source and building the community.
Tips for New Users, Experts, and Everyone in Between
Cheuk Ting Ho
are you with Pandas?
If you know the type of the csv, try using dtype optionl
pd.read_csv(os.path.join(self.path_movies),
usecols=['movieId', 'title'],
dtype={'movieId': 'int32', 'title': 'str'})
Not every numbers need to be float64 or int64
Bonus: no more missed guesses and faster
Processing in chunks
chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
for chunk in reader:
process(chunk)
Especially useful when the processed data is then stored is databases or disks
There are multiple ways of doing things, for example:
I prefer the latter to the former for reasons:
1. practicality
2. performance
3. popularity
Flattening the multi-index
df.reset_index()
Very useful after groupby, aggregation and pivot_table
* check out this blog post
About Json Normalized
Using SQL query to DataFrame
df_movies_cnt.query('count >= @self.movie_rating_thres')
Handy if you are familiar with SQL
df['user']
# gives you a Series
df[['user']]
# gives you a DataFrame with 1 column
about groupby
Still very useful skill to have
Way to master pandas:
By Cheuk Ting Ho
Developer advocate / Data Scientist - support open-source and building the community.