Python Programming

Class 6

  • Python Data Analysis Library (pandas) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.  

Pandas Series

  • A Pandas Series is a one-dimensional array of indexed data. 
  • The Pandas Series is much more general and flexible than the one-dimensional NumPy array.
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

data.values #Values are a numpy array 

data.index #The index is an array-like object of type pd.Index

#Associated index
data[1]

data[1:3]

Constructing Series objects

Note:  

  • NumPy array has an implicitly defined integer index used to access the values.
  • Pandas Series has an explicitly defined index associated with the values.
#Constructing Series objects
data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])
data

data['a'] #Access item 

#We can even use noncontiguous or nonsequential indices:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data[7]

#Python dictionaries to Pandas Series 
serie = pd.Series({2:'a', 1:'b', 3:'c'})

Constructing Series objects

  • Pandas Series makes it much more efficient than Python dictionaries for certain data manipulation operations.
#Python dictionaries to Pandas Series 
serie = pd.Series({2:'a', 1:'b', 3:'c'})

Pandas Dataframe

  • DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
  • You can think of it as a spreadsheet data representation.

Constructing DataFrame objects

#Create an Empty DataFrame
import pandas as pd
df = pd.DataFrame()
print df

#Create a DataFrame from Lists
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df

import pandas as pd
dataFrutas = [['Manzana',100],['Pera',105],['Banano',130]]
df = pd.DataFrame(dataFrutas,columns=['Nombre','Peso(gr)'])
print df

Constructing DataFrame objects

#Create a DataFrame from Dict of ndarrays / Lists
import pandas as pd
dataFrutas = {'Nombre':['Manzana', 'Pera', 'Banano', 'Fresa'],'Peso(gr)':[100,105,130,42]}
df = pd.DataFrame(dataFrutas)
print df

import pandas as pd
dataFrutas = {'Nombre':['Manzana', 'Pera', 'Banano', 'Fresa'],'Peso(gr)':[100,105,130,42]}
df = pd.DataFrame(dataFrutas, index=['primera','segunda','tercera','cuarta'])
print df

df['Nombre']['primera']

#Create a DataFrame from List of Dicts
import pandas as pd
dataFrutas = [{'Nombre': 'Manzana', 'Peso(gr)': 100},
{'Nombre': 'Pera', 'Peso(gr)': 105, 'ciudad': 'Bogota'}]
df = pd.DataFrame(dataFrutas)
print df

Dataframe Indexing

(loc vs iloc)

  • loc gets rows (or columns) with particular labels from the index.

 

  • iloc gets rows (or columns) at particular positions in the index (so it only takes integers).

Dataframe Indexing

(loc vs iloc)

import numpy as np
import pandas as pd

df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), 
                  index= [2, 'A', 2], columns=['primera', 49, 'tercera'])

#loc and iloc differences 

ans = df.loc[2] # index named 2
ansA = df.loc['A'] # index named 'A'as a Series
ansA = df.loc[['A']] # index named 'A' as DataFrame

# 2 to iloc
ans2 = df.iloc[2] #index at position 2 as a Series
ans2 = df.iloc[[2]] #index at position 2 as a Dataframe

Read CSV into a DataFrame object

#Read csv
import pandas as pd 

movies = pd.read_csv('../Dropbox/movies.csv',sep=',') 

DataFrame data manipulation

#Rename a column
movies.rename(columns={'movieId': 'peliculaId', 'title': 'titulo'}, inplace=True)

#Select a column from dataframe 
titulo = movies['titulo']

#Select row 
movies.iloc[2]
titulo.iloc[2]

#select rows
primeras_150 = movies.iloc[0:150] #First 150 movies titles

#Addition of rows
movies = movies.append(primeras_150.iloc[10])

#reset index .reset_index()
movies = movies.reset_index(drop=True)

#Delete rows 
movies = movies.drop(9125)

#Column deletion
del movies['peliculaId']

#get column names
movies.columns

Vectorized string operations

 

Vectorized string operations

 

#Vectorized String Operations
movies['titulo'] = movies['titulo'].str.lower()
movies['genres'] = movies['genres'].str.upper()

#Replacing All Occurrences of a String in a DataFrame
movies.replace(['ACTION', 'COMEDY', ], ['ACCION','COMEDIA'],inplace = True) 

Iterate over a DataFrame  or serie

 

for index, row in movies.iterrows() :
    print(row['titulo'], row['genres'])

Save DataFrame object as CSV file

#Write csv
movies.to_csv('../Dropbox/moviesFinal.csv',sep=',') 

Challenge 1

  • Download cvs file from https://goo.gl/9kPkcE
  • Read cvs file into DataFrame and call it titanic_train.
  • Rename column 'name' to 'nombre'.
  • Upper case 'nombre' column.
  • Select rows from index 300 to 400 into a new DataFrame
  • Select columns 'PassengerId', 'nombre', 'Survived' from last DataFrame into a new one.  
  • Save final DataFrame as Result.csv

Answer 1

#Read cvs file into DataFrame and call it titanic_train.
titanic_train = pd.read_csv('../Dropbox/train.csv',sep=',') 

#Rename column 'name' to 'nombre'. 
titanic_train.rename(columns={'Name': 'nombre'}, inplace=True)

#Upper case 'nombre' column. 
titanic_train['nombre'] = titanic_train['nombre'].str.upper()

#Select rows that are from index 300 to 400 into a new DataFrame
select_rows_df = titanic_train.iloc[300:401]

#Select columns 'PassengerId', 'nombre', 'Survived' into a new DataFrame
result_df = select_rows_df[['PassengerId','nombre','Survived']]

#Save final DataFrame as Result.csv
result_df.to_csv('../Dropbox/Result.csv',sep=',',index=False)

Resources

Python-Programming [Class 6]

By Jose Arrieta

Python-Programming [Class 6]

Pandas, Series, DataFrame, Vectorized string operations, Iterate over a DataFrame, Save DataFrame object as CSV file

  • 3,346