Introduction to

and

Creator: Wes McKinney - Benevolent Dictator for Life
Motivation: quantitative analysis in financial data
Objective: in-memory data wrangling, preparation, and analytics

Creator: John D. Hunter - Neurobiologist (1968-2012)
Motivation: visualize Electrocorticography data of epilepsy patients
Objective: plotting (easy things easy and hard things possible)

Travis Oliphant

  
  ivanleon@ilg40: ~ $ pip3 show pandas
  Name: pandas
  Version: 0.23.4
  Summary: Powerful data structures for data analysis, time series, and statistics
  Home-page: http://pandas.pydata.org
  Author: None
  Author-email: None
  License: BSD
  Location: /usr/local/lib/python3.5/dist-packages
  Requires: pytz, numpy, python-dateutil
  Required-by: 

  ivanleon@ilg40: ~ $ pip3 show matplotlib
  Name: matplotlib
  Version: 3.0.2
  Summary: Python plotting package
  Home-page: http://matplotlib.org
  Author: John D. Hunter, Michael Droettboom
  Author-email: matplotlib-users@python.org
  License: BSD
  Location: /usr/local/lib/python3.5/dist-packages
  Requires: cycler, python-dateutil, pyparsing, numpy, kiwisolver
  Required-by:

NumPy Super Powers !!!


  import numpy as np

Range of Numbers


  >>> numbers = []
  >>> 
  >>> for i in range(0, 11):
  ...     numbers.append(i)
  ... 
  >>> numbers
  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


  >>> numbers_2 = [n for n in range(0, 11)]
  >>> 
  >>> numbers_2
  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

  
  >>> numbers_3 = np.arange(0, 11)
  >>>
  >>> numbers_3
  array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
  >>>
  >>> list(numbers_3)
  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Array Math


  >>> l1 = [1, 1, 1, 1]
  >>> l2 = [2, 2, 2, 2]
  >>> l1 + l2
  [1, 1, 1, 1, 2, 2, 2, 2]
  

  >>> l1 = [1, 1, 1, 1]
  >>> l2 = [2, 2, 2, 2]
  >>> l3 = []
  >>> for i, j in zip(l1, l2):
  ...     l3.append(i + j)
  ... 
  >>> l3
  [3, 3, 3, 3]

  >>> a1 = np.array(l1)
  >>> a2 = np.array(l2)
  >>> a1 + a2
  array([3, 3, 3, 3])
  >>> a3 = list(a1 + a2)
  >>> a3
  [3, 3, 3, 3]

Array Math

>>> l3
[3, 3, 3, 3]
>>> l3 * 3
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

>>> 
>>> result = [n*3 for n in l3]
>>> result
[9, 9, 9, 9]

>>> result_2 * 3
array([9, 9, 9, 9])
>>>
>>> list(result_2 * 3)
[9, 9, 9, 9]

Pandas


  $ cat sample_countries_gdp.py

  import pandas as pd

  ds = {
        2016: [3.4, 1.6, 2.9, 3.6],
        2017: [3.6, 1.7, 2.7, 3.4],
        2018: [3.9, 1.5, 2.9, 3.3]
  }

  countries = ["Norway", "India", "Argentina", "China"]

  df = pd.DataFrame(ds, index=countries)

  print("\n", df)

Pandas - DataFrame

Dictionary as DataFrame


  $ python3 sample_countries_gdp.py

             2016  2017  2018
  Australia   3.4   3.6   3.9
  India       1.6   1.7   1.5
  Argentina   2.9   2.7   2.9
  China       3.6   3.4   3.3

Pandas - DataFrame

Dictionary as DataFrame


  $ cat it_companies.csv 
  IBM,3.6,3.2,3.5
  Microsoft,3.4, 3.5,3.4
  Apple,2.8,3.9,3.7
  HP,3.8,3.3,3.7

Pandas - DataFrames

Read CSV


  $ cat read_csv.py
 
  import pandas as pd

  df = pd.read_csv("it_companies.csv", header=None)

  print(df)

Pandas - DataFrames

Read CSV


  $ python3 read_csv.py 
             0    1    2    3
  0        IBM  3.6  3.2  3.5
  1  Microsoft  3.4  3.5  3.4
  2      Apple  2.8  3.9  3.7
  3         HP  3.8  3.3  3.7

Pandas - DataFrames

Read CSV


  ### Before ######################################

             0    1    2    3
  0        IBM  3.6  3.2  3.5
  1  Microsoft  3.4  3.5  3.4
  2      Apple  2.8  3.9  3.7
  3         HP  3.8  3.3  3.7
  #################################################

  import pandas as pd

  df = pd.read_csv("it_companies.csv", header=None) <------- reading .CSV
  new_index = list(df[0])                                                         
  new_columns = [2016, 2017, 2018]                                                
  del df[0]                                                                       
  df.columns = new_columns                                                        
  df.index = new_index                                                            
  print(df)                                                                       


  ### After ######################################

             2016  2017  2018
  IBM         3.6   3.2   3.5
  Microsoft   3.4   3.5   3.4
  Apple       2.8   3.9   3.7
  HP          3.8   3.3   3.7
  ################################################

Pandas - DataFrames

Data Wrangling (pt.1)


  ### After ######################################
             2016  2017  2018
  IBM         3.6   3.2   3.5
  Microsoft   3.4   3.5   3.4
  Apple       2.8   3.9   3.7
  HP          3.8   3.3   3.7
  ################################################

  import pandas as pd

  df = pd.read_csv("it_companies.csv", header=None) <------- reading .CSV
  new_index = list(df[0])
  new_columns = [2016, 2017, 2018]                                                
  del df[0]                                                                       
  df.columns = new_columns
  df.index = new_index
  df.at['IBM', 2018] = 6.9                                                        
  df.at['Microsoft', 2018] = 2.8
  df.to_csv("it_companies_new.csv", header=None)    <------- saving to new .CSV
  print(df)                                                                       

  ### After ######################################
             2016  2017  2018
  IBM         3.6   3.2   6.9
  Microsoft   3.4   3.5   2.8
  Apple       2.8   3.9   3.7
  HP          3.8   3.3   3.7
  ################################################

Pandas - DataFrames

Data Wrangling (pt.2)

Pandas - DataFrames

Data Wrangling (pt.3)

DATOS.GOB.MX

<JALISCO>

Indicadores de Eficiencia por Escuela

Pandas - DataFrames

Data Wrangling (pt.3)

>>> df.head()
     CLAVE CT NOM TURNO     NOMBRE CT    ... REPROBACIÓN CON REGULARIZADOS EFICIENCIA TER..
0  14DJN0128O  MATUTINO   JUAN DE LAR..  ...           N.A.                N.A.
1  14DJN0129N  MATUTINO   JOSEFA DOMI..  ...           N.A.                N.A.
2  14DJN0130C  MATUTINO   ROSAURA ZAP..  ...           N.A.                N.A.q
3  14DJN0132A  MATUTINO   UNIDAD MODE..  ...           N.A.                N.A.
4  14DJN0133Z  MATUTINO   CIPRIANA GU..  ...           N.A.                N.A.
....

[14018 rows x 17 columns]


>>> df.columns
Index(['CLAVE CT', 'NOM TURNO', 'NOMBRE CT', 'DOMICILIO', 'LOCALIDAD INEGI',
       'NOMBRE LOCALIDAD', 'COLONIA', 'NOMBRE COLONIA',
       'MUNICIPIO CLAVE INEGI', 'NOMBRE MUNICIPIO', 'SOSTENIMIENTO', 'NIVEL',
       'PROGRAMA', 'DESERCIÓN INTRACURRICULAR', 'REPROBACIÓN',
       'REPROBACIÓN CON REGULARIZADOS', 'EFICIENCIA TERMINAL'],
      dtype='object')

Pandas - DataFrames

Data Wrangling (pt.3)


  >>> df["DESERCIÓN INTRACURRICULAR"].head()
  0    N.A.
  1    N.A.
  2    N.A.
  3    N.A.
  4    N.A.
  Name: DESERCIÓN INTRACURRICULAR, dtype: object

  >>> df["DESERCIÓN INTRACURRICULAR"].tail()
  14013    0.00%
  14014    0.00%
  14015    0.00%
  14016    0.00%
  14017    0.00%
  Name: DESERCIÓN INTRACURRICULAR, dtype: object

Pandas - DataFrames

Data Wrangling (pt.3)


  >>> dic_matutino = df.loc[df['NOM TURNO'] == "MATUTINO"]["DESERCIÓN INTRACURRICULAR"]
  >>> dic_matutino = [float(d.strip("%")) for d in dic_matutino if d != "N.A."]
  >>> dic_matutino_med = pd.Series(dic_matutino).mean()
  >>> dic_matutino
  3.420325159914712

  >>> dic_vespertino = df.loc[df['NOM TURNO'] == "VESPERTINO"]["DESERCIÓN INTRACURRICULAR"]
  >>> dic_vespertino = [float(d.strip("%")) for d in dic_vespertino if d != "N.A."]
  >>> dic_vespertino = pd.Series(dic_vespertino).mean()
  >>> dic_vespertino
  5.200691310609698
 
  >>> 
  >>> dic_nocturno = df.loc[df['NOM TURNO'] == "NOCTURNO"]["DESERCIÓN INTRACURRICULAR"]
  >>> dic_nocturno = [float(d.strip("%")) for d in dic_nocturno if d != "N.A."]
  >>> dic_nocturno = pd.Series(dic_nocturno).mean()
  >>> dic_nocturno
  16.28141176470588

Matplotlib


  import matplotlib.pyplot as plt       
                                                               
  years = [1950, 1960, 1970, 1980, 1990, 2000, 2010] 
  population = [2.584, 3.033, 3.700, 4.458, 5.331, 6.145, 6.958]
                                                               
  plt.plot(years, population)
                                                     
  plt.grid()     
  plt.title("World Population (1950-2010)") 
  plt.xlabel("Years")                         
  plt.ylabel("Population (billions)")       

  plt.show()

Matplotlib

  
  import matplotlib.pyplot as plt       
  import pandas as pd 
                              
  df = pd.read_csv("it_companies.csv", header=None)  
                                                                                                                                                              
  new_index = list(df[0])
  new_columns = [2013, 2014, 2015, 2016, 2017, 2018]   
  del df[0]                               
  df.columns = new_columns 
  df.index = new_index 
                                                   
  plt.plot(df.columns, df.iloc[0], label="IBM")         
  plt.plot(df.columns, df.iloc[1], label="Microsoft")    
  plt.plot(df.columns, df.iloc[2], label="Apple") 
  plt.plot(df.columns, df.iloc[3], label="HP") 
                               
  plt.ylim(0, 7)       
  plt.xlabel("Years")    
  plt.ylabel("Revenue in U$ (Bi)")  
  plt.title("Revenue x Years") 
  plt.legend() 
  plt.grid() 
                        
  plt.savefig('plot.png')  
  plt.show()

Matplotlib

Profile

Ivan Leon
Brazilian (Gaúcho)
Software Developer

https://xalapacode.com/directorio/ivanleoncz/

https://github.com/ivanlmj

https://stackoverflow.com/users/5780109/ivanleoncz?tab=profile

Thanks!

Bye!

Introduction to Pandas and Matplotlib

By Ivan Leon

Introduction to Pandas and Matplotlib

Key concepts, techniques and importance of both tools.

Ivan Leon

Software Developer

Travis Oliphant

NumPy Super Powers !!!

Range of Numbers

Array Math

Array Math

Pandas

DATOS.GOB.MX

<JALISCO>

Indicadores de Eficiencia por Escuela

Matplotlib

Matplotlib

Matplotlib

Matplotlib

Matplotlib

Thanks!

Bye!

Introduction to Pandas and Matplotlib

More from Ivan Leon