An Introduction to Pandas
ayush1997
a skill to be mastered......
About me
-
CS Sophomore
-
Pythonista !
-
Hackathon Lover
-
FOSS Enthusiast
-
Mentor @ DevSocMSIT
What is it?
Pandas is a powerful data analysis toolkit providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easily and intuitively.
Pandas = Python + Numpy + R
Why Pandas?
-
Highly optimized for performance, with critical code paths written in Cython or C.
-
Easy handling of missing data (represented as NaN)
-
Robust IO tools for loading/saving data from/to different formats(CSV,HDF5,JSON.....)
-
Intuitive merging and joining of data sets
-
Easy label-based slicing, indexing, and subsetting of large data sets
-
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
-
Combined with the excellent IPython toolkit and other libraries
Installation
Using Pip
pip install pandas
Using Conda
conda install pandas
import pandas as pd
Agenda
- Data Structures
- I/O Tools
- Basic Functions
- Indexing and selecting data
- Working with missing data
- GroupBy
- Merge,Join and Concatenate
The Data
DATA STRUCTURES
DataFrame
It is a tablular data structure comprised of rows and columns.
Series
A Series is a one-dimensional object similar to an array, list, or column in a table.
Series
- Using random list
- Using Dictionary
DataFrame
- From Dictionary
- Using list of lists
I/O Tools
The pandas I/O API is a set of top level reader and writer functions that generally return a pandas object.
- read_csv
- read_excel
- read_hdf
- read_sql
- read_json
- read_html
- read_pickle
- to_csv
- to_excel
- to_hdf
- to_sql
- to_json
- to_html
- to_pickle
- Reading CSV
- Writing to CSV
Essential Basic Functions
Head and Tail
Columns and indexes
Descriptive Statistic
Data Summary
Deleting
Rename
Unique
Indexing & Selecting Data
Different selection methods
pandas provides a suite of methods in order to get integer and label based indexing. The semantics follow closely python and numpy slicing
- Some Basic indexing
- .loc is used for label based selection
- .iloc is basically integer position based
- .ix is for mixed label and integer position based
- Boolean indexing
Basic Indexing [ ]
- To get columns
- Slicing rows
- To get a cell
.loc
Selection by Label
Selection by Position
.iloc
.ix
Boolean Indexing
Boolean vector to filter data
| fo r or & for and ~ for not
Working With Missing Data
In pandas the missing data is represented by NaN
Check for Null values
Filling Missing Data
Droping Data
GroupBy
split-apply-combine
By “group by” we are referring to a process involving one or more of the following steps
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
Splitting an object into Groups
df.groupby(["Pclass"])
Applying
Once GroupBy objects have been created we can compute a summary statistic (or statistics) about each group
Once GroupBy objects have been created we can compute a summary statistic (or statistics) about each group
aggregate( )
-
Applying multiple functions at once
-
Applying different functions to DataFrame Columns
Merge,Join
and Concatenate
Concatenating Objects
The concat function performs concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.
df1
df2
concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False)
-
along column axis
-
along row axis
Concatenating Using append( )
A useful shortcut to concat are the append instance methods on Series and DataFrame.
They concatenate along axis=0
Text
To add a row
Merging/Joining
pandas has full-featured, high performance in-memory join/merge operations idiomatically very similar to relational databases like SQL
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False)
df1
df2
Text
Questions?
ayush0016
ayush1997
ayushkumar97
https://github.com/ayush1997/Pandas-Tutorial
Pandas
By Ayush Singh
Pandas
Introduction To Pandas : Python Data Analysis Toolkit
- 1,707