Cheuk Ting Ho
are you with Pandas?
If you know the type of the csv, try using dtype optionl
pd.read_csv(os.path.join(self.path_movies),
usecols=['movieId', 'title'],
dtype={'movieId': 'int32', 'title': 'str'})
Not every numbers need to be float64 or int64
Bonus: no more missed guesses and faster
Processing in chunks
chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
for chunk in reader:
process(chunk)
Especially useful when the processed data is then stored is databases or disks
There are multiple ways of doing things, for example:
I prefer the latter to the former for reasons:
1. practicality
2. performance
3. popularity
Flattening the multi-index
df.reset_index()
Very useful after groupby, aggregation and pivot_table
* check out this blog post
Using SQL query to DataFrame
df_movies_cnt.query('count >= @self.movie_rating_thres')
Handy if you are familiar with SQL
df['user']
# gives you a Series
df[['user']]
# gives you a DataFrame with 1 column
Still very useful skill to have
Way to master pandas:
Details: https://www.euroscipy.org/2022/
See you there ❤️