Python Programming
Class 6
- Python Data Analysis Library (pandas) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Source: https://pandas.pydata.org/
Pandas Series
- A Pandas Series is a one-dimensional array of indexed data.
- The Pandas Series is much more general and flexible than the one-dimensional NumPy array.
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
data.values #Values are a numpy array
data.index #The index is an array-like object of type pd.Index
#Associated index
data[1]
data[1:3]
Constructing Series objects
Note:
- NumPy array has an implicitly defined integer index used to access the values.
- Pandas Series has an explicitly defined index associated with the values.
#Constructing Series objects
data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])
data
data['a'] #Access item
#We can even use noncontiguous or nonsequential indices:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data[7]
#Python dictionaries to Pandas Series
serie = pd.Series({2:'a', 1:'b', 3:'c'})
Constructing Series objects
- Pandas Series makes it much more efficient than Python dictionaries for certain data manipulation operations.
#Python dictionaries to Pandas Series
serie = pd.Series({2:'a', 1:'b', 3:'c'})
Pandas Dataframe
- DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
- You can think of it as a spreadsheet data representation.
Constructing DataFrame objects
#Create an Empty DataFrame
import pandas as pd
df = pd.DataFrame()
print df
#Create a DataFrame from Lists
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
import pandas as pd
dataFrutas = [['Manzana',100],['Pera',105],['Banano',130]]
df = pd.DataFrame(dataFrutas,columns=['Nombre','Peso(gr)'])
print df
Constructing DataFrame objects
#Create a DataFrame from Dict of ndarrays / Lists
import pandas as pd
dataFrutas = {'Nombre':['Manzana', 'Pera', 'Banano', 'Fresa'],'Peso(gr)':[100,105,130,42]}
df = pd.DataFrame(dataFrutas)
print df
import pandas as pd
dataFrutas = {'Nombre':['Manzana', 'Pera', 'Banano', 'Fresa'],'Peso(gr)':[100,105,130,42]}
df = pd.DataFrame(dataFrutas, index=['primera','segunda','tercera','cuarta'])
print df
df['Nombre']['primera']
#Create a DataFrame from List of Dicts
import pandas as pd
dataFrutas = [{'Nombre': 'Manzana', 'Peso(gr)': 100},
{'Nombre': 'Pera', 'Peso(gr)': 105, 'ciudad': 'Bogota'}]
df = pd.DataFrame(dataFrutas)
print df
Dataframe Indexing
(loc vs iloc)
- loc gets rows (or columns) with particular labels from the index.
- iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
Dataframe Indexing
(loc vs iloc)
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
index= [2, 'A', 2], columns=['primera', 49, 'tercera'])
#loc and iloc differences
ans = df.loc[2] # index named 2
ansA = df.loc['A'] # index named 'A'as a Series
ansA = df.loc[['A']] # index named 'A' as DataFrame
# 2 to iloc
ans2 = df.iloc[2] #index at position 2 as a Series
ans2 = df.iloc[[2]] #index at position 2 as a Dataframe
Read CSV into a DataFrame object
#Read csv
import pandas as pd
movies = pd.read_csv('../Dropbox/movies.csv',sep=',')
DataFrame data manipulation
#Rename a column
movies.rename(columns={'movieId': 'peliculaId', 'title': 'titulo'}, inplace=True)
#Select a column from dataframe
titulo = movies['titulo']
#Select row
movies.iloc[2]
titulo.iloc[2]
#select rows
primeras_150 = movies.iloc[0:150] #First 150 movies titles
#Addition of rows
movies = movies.append(primeras_150.iloc[10])
#reset index .reset_index()
movies = movies.reset_index(drop=True)
#Delete rows
movies = movies.drop(9125)
#Column deletion
del movies['peliculaId']
#get column names
movies.columns
Vectorized string operations
Vectorized string operations
#Vectorized String Operations
movies['titulo'] = movies['titulo'].str.lower()
movies['genres'] = movies['genres'].str.upper()
#Replacing All Occurrences of a String in a DataFrame
movies.replace(['ACTION', 'COMEDY', ], ['ACCION','COMEDIA'],inplace = True)
Iterate over a DataFrame or serie
for index, row in movies.iterrows() :
print(row['titulo'], row['genres'])
Save DataFrame object as CSV file
#Write csv
movies.to_csv('../Dropbox/moviesFinal.csv',sep=',')
Challenge 1
- Download cvs file from https://goo.gl/9kPkcE
- Read cvs file into DataFrame and call it titanic_train.
- Rename column 'name' to 'nombre'.
- Upper case 'nombre' column.
- Select rows from index 300 to 400 into a new DataFrame
- Select columns 'PassengerId', 'nombre', 'Survived' from last DataFrame into a new one.
- Save final DataFrame as Result.csv
Answer 1
#Read cvs file into DataFrame and call it titanic_train.
titanic_train = pd.read_csv('../Dropbox/train.csv',sep=',')
#Rename column 'name' to 'nombre'.
titanic_train.rename(columns={'Name': 'nombre'}, inplace=True)
#Upper case 'nombre' column.
titanic_train['nombre'] = titanic_train['nombre'].str.upper()
#Select rows that are from index 300 to 400 into a new DataFrame
select_rows_df = titanic_train.iloc[300:401]
#Select columns 'PassengerId', 'nombre', 'Survived' into a new DataFrame
result_df = select_rows_df[['PassengerId','nombre','Survived']]
#Save final DataFrame as Result.csv
result_df.to_csv('../Dropbox/Result.csv',sep=',',index=False)
Resources
Python-Programming [Class 6]
By Jose Arrieta
Python-Programming [Class 6]
Pandas, Series, DataFrame, Vectorized string operations, Iterate over a DataFrame, Save DataFrame object as CSV file
- 3,346