Pandas

Pandas es la librería por excelencia para el análisis de datos del lenguaje Python. Su nombre proviene de “panel data” (término econométrico). Inspirada en las funcionalidades de R, pero con el potencial de este lenguaje de propósito general.

Pandas incluye todas las funcionalidades necesarias para el proceso de análisis de datos: carga, filtrado, tratamiento, síntesis, agrupamiento, almacenamiento y visualización. Además, se integra con el resto de librerías de cálculo numérico como Numpy, Matplotlib, scikit-learn, … y de despliegue: HPC, Cloud, etc.

En resumen, es como una hoja de cálculo -por ejemplo excel- pero con más mucho más potencial!!!

Características principales

Todo el trabajo que realizaremos es sobre la estructura de datos básica: el dataFrame.

Un dataFrame es un objeto de dos dimensiones que contiene información. También puede verse como una hoja de cálculo, como una tabla de un modelo entidad-relación, o como una colección de una base de datos no relacional.

Documentación

[2]:

%pip install pandas

Requirement already satisfied: pandas in /Users/isaac/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/isaac/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/isaac/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages (from pandas) (2023.3)
Requirement already satisfied: numpy>=1.21.0 in /Users/isaac/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages (from pandas) (1.23.5)
Requirement already satisfied: six>=1.5 in /Users/isaac/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)

[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

Importación de la libreria

import pandas as pd
import pandas
from pandas import *

Por convención se hace de la siguiente manera, todas las funciones de la libreria se tienen que llamar con el prefijo pd.*:

[1]:

import pandas as pd

Carga de datos

Vamos a aprender Pandas a través de una serie de proyectos y ejemplos. En esta primera fase, vamos a cargar datos de un fichero CSV, recordad que son ficheros donde los atributos/valores de una observación están separados por una coma y las observaciones se separan mediante un salto de línea.

Podemos descargar los datos con los que trabajaremos del siguiente enlace WHO dataset

Empezamos viendo como se carga un dataframe a partir de un fichero en formato CSV.

[2]:

df = pd.read_csv("data/WHO.csv")

[6]:

df

[6]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
0	Afghanistan	1	1	151.0	28.0	NaN	NaN	NaN	26088.0	4.0	...	692.50	NaN	NaN	NaN	257.00	231.9	257.00	5740436.0	5.44	22.9
1	Albania	2	2	27.0	98.7	6000.0	93.0	94.0	3172.0	0.6	...	3499.12	4.790000e+09	78.14	-2.040000e+09	18.47	15.5	18.47	1431793.9	2.21	45.4
2	Algeria	3	3	6.0	69.9	5940.0	94.0	96.0	33351.0	1.5	...	137535.56	6.970000e+10	351.36	4.700000e+09	40.00	31.2	40.00	20800000.0	2.61	63.3
3	Andorra	4	2	NaN	NaN	NaN	83.0	83.0	74.0	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Angola	5	3	146.0	67.4	3890.0	49.0	51.0	16557.0	2.8	...	8991.46	1.490000e+10	27.13	9.140000e+09	164.10	242.5	164.10	8578749.0	4.14	53.3
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
197	Vietnam	198	6	25.0	90.3	2310.0	91.0	96.0	86206.0	1.4	...	101826.23	4.480000e+10	47.11	-1.940000e+09	20.20	23.4	20.20	21900000.0	2.90	26.4
198	West Bank and Gaza	199	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	655.86	3.780000e+09	NaN	NaN	28.00	25.8	28.00	2596216.0	3.33	71.6
199	Yemen	200	1	83.0	54.1	2090.0	65.0	85.0	21732.0	3.0	...	20148.34	1.150000e+10	114.52	8.310000e+08	82.40	87.9	82.40	5759120.5	4.37	27.3
200	Zambia	201	3	161.0	68.0	1140.0	94.0	90.0	11696.0	1.9	...	2366.94	4.090000e+09	10.41	-4.470000e+08	175.30	163.8	175.30	4017411.0	1.95	35.0
201	Zimbabwe	202	3	101.0	89.5	NaN	88.0	87.0	13228.0	0.8	...	11457.33	5.620000e+09	3.39	-1.710000e+08	106.50	67.0	106.50	4709965.0	1.90	35.9

202 rows × 358 columns

A continuación se muestra la estructura interna del DataFrame. Se puede ver que és muy parecido a una tabla bidimensional:

[3]:

df

[3]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
0	Afghanistan	1	1	151.0	28.0	NaN	NaN	NaN	26088.0	4.0	...	692.50	NaN	NaN	NaN	257.00	231.9	257.00	5740436.0	5.44	22.9
1	Albania	2	2	27.0	98.7	6000.0	93.0	94.0	3172.0	0.6	...	3499.12	4.790000e+09	78.14	-2.040000e+09	18.47	15.5	18.47	1431793.9	2.21	45.4
2	Algeria	3	3	6.0	69.9	5940.0	94.0	96.0	33351.0	1.5	...	137535.56	6.970000e+10	351.36	4.700000e+09	40.00	31.2	40.00	20800000.0	2.61	63.3
3	Andorra	4	2	NaN	NaN	NaN	83.0	83.0	74.0	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Angola	5	3	146.0	67.4	3890.0	49.0	51.0	16557.0	2.8	...	8991.46	1.490000e+10	27.13	9.140000e+09	164.10	242.5	164.10	8578749.0	4.14	53.3
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
197	Vietnam	198	6	25.0	90.3	2310.0	91.0	96.0	86206.0	1.4	...	101826.23	4.480000e+10	47.11	-1.940000e+09	20.20	23.4	20.20	21900000.0	2.90	26.4
198	West Bank and Gaza	199	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	655.86	3.780000e+09	NaN	NaN	28.00	25.8	28.00	2596216.0	3.33	71.6
199	Yemen	200	1	83.0	54.1	2090.0	65.0	85.0	21732.0	3.0	...	20148.34	1.150000e+10	114.52	8.310000e+08	82.40	87.9	82.40	5759120.5	4.37	27.3
200	Zambia	201	3	161.0	68.0	1140.0	94.0	90.0	11696.0	1.9	...	2366.94	4.090000e+09	10.41	-4.470000e+08	175.30	163.8	175.30	4017411.0	1.95	35.0
201	Zimbabwe	202	3	101.0	89.5	NaN	88.0	87.0	13228.0	0.8	...	11457.33	5.620000e+09	3.39	-1.710000e+08	106.50	67.0	106.50	4709965.0	1.90	35.9

202 rows × 358 columns

df.shape

[3]:

df.shape

[3]:

(202, 358)

[4]:

df.columns

[4]:

Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
       'Adult literacy rate (%)',
       'Gross national income per capita (PPP international $)',
       'Net primary school enrolment ratio female (%)',
       'Net primary school enrolment ratio male (%)',
       'Population (in thousands) total', 'Population annual growth rate (%)',
       ...
       'Total_CO2_emissions', 'Total_income', 'Total_reserves',
       'Trade_balance_goods_and_services', 'Under_five_mortality_from_CME',
       'Under_five_mortality_from_IHME', 'Under_five_mortality_rate',
       'Urban_population', 'Urban_population_growth',
       'Urban_population_pct_of_total'],
      dtype='object', length=358)

[6]:

df.index

[6]:

RangeIndex(start=0, stop=202, step=1)

Atributos de un DataFrame

Un dataframe dispone de diferentes atributos con los que podemos obtener su información o metainformación. Los siguientes ejemplos muestran cómo se pueden consultar sus dimensiones o un listado del nombre de sus columnas:

[4]:

df.shape # Ver las dimensiones **

[4]:

(202, 358)

[11]:

df.columns[:5]

[11]:

Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
       'Adult literacy rate (%)'],
      dtype='object')

Podemos aplicar sobre el listado de columnas todas las operaciones sobre listas que hemos visto en la introducción del curso. A continuación tenemos dos ejemplos de indexación:

[12]:

df.columns[-2]

[12]:

'Urban_population_growth'

[13]:

df.columns[0:5]

[13]:

Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
       'Adult literacy rate (%)'],
      dtype='object')

[14]:

df.columns[-5:-1]

[14]:

Index(['Under_five_mortality_from_IHME', 'Under_five_mortality_rate',
       'Urban_population', 'Urban_population_growth'],
      dtype='object')

¿Cómo consultariais el nombre de la columna 10? ¿y los de las columnas de la 200 a la 225?

[12]:

len(df.columns)

[12]:

[8]:

df.columns[200:226]

df.columns[len(df.columns)-3:len(df.columns)]

[8]:

Index(['Urban_population', 'Urban_population_growth',
       'Urban_population_pct_of_total'],
      dtype='object')

Funciones descriptivas de un dataframe

Pandas ofrece una colección de funciones que permiten realizar una inspección general de la tabla de datos:

describe: muestra estadísticas descriptivas básicas para todas las columnas numéricas.
info: muestra todas las columnas y sus tipos de datos.
head i tail: muestra las $n$ primeras/últimas filas. El valor de $n$ es un parámetro de este método.

[15]:

df.describe()

[15]:

	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	Population in urban areas (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
count	202.000000	202.000000	177.000000	131.000000	178.000000	179.000000	179.000000	1.930000e+02	193.000000	193.000000	...	1.860000e+02	1.780000e+02	128.000000	1.710000e+02	181.000000	170.000000	181.000000	1.880000e+02	188.000000	188.000000
mean	101.500000	3.579208	59.457627	78.871756	11250.112360	84.033520	85.698324	3.409805e+04	1.297927	54.911917	...	1.483596e+05	2.015567e+11	57.253516	3.424012e+08	56.677624	54.356471	56.677624	1.665763e+07	2.165851	55.195213
std	58.456537	1.808263	49.105286	20.415760	12586.753417	17.788047	15.451212	1.304957e+05	1.163864	23.554182	...	6.133091e+05	9.400689e+11	138.669298	5.943043e+10	60.060929	61.160556	60.060929	5.094867e+07	1.596628	23.742122
min	1.000000	1.000000	0.000000	23.600000	260.000000	6.000000	11.000000	2.000000e+00	-2.500000	10.000000	...	2.565000e+01	5.190000e+07	0.990000	-7.140000e+11	2.900000	3.000000	2.900000	1.545600e+04	-1.160000	10.000000
25%	51.250000	2.000000	19.000000	68.400000	2112.500000	79.000000	79.500000	1.340000e+03	0.500000	36.000000	...	1.672615e+03	3.317500e+09	16.292500	-1.210000e+09	12.400000	8.475000	12.400000	9.171623e+05	1.105000	35.650000
50%	101.500000	3.000000	46.000000	86.500000	6175.000000	90.000000	90.000000	6.762000e+03	1.300000	57.000000	...	1.021157e+04	1.145000e+10	28.515000	-2.240000e+08	29.980000	27.600000	29.980000	3.427661e+06	1.945000	57.300000
75%	151.750000	5.000000	91.000000	95.300000	14502.500000	96.000000	96.000000	2.173200e+04	2.100000	73.000000	...	6.549217e+04	8.680000e+10	55.310000	1.024000e+09	88.700000	82.900000	88.700000	9.837113e+06	3.252500	72.750000
max	202.000000	7.000000	199.000000	99.800000	60870.000000	100.000000	100.000000	1.328474e+06	4.300000	100.000000	...	5.776432e+06	1.100000e+13	1334.860000	1.390000e+11	267.000000	253.700000	267.000000	5.270000e+08	7.850000	100.000000

8 rows × 357 columns

[13]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Columns: 358 entries, Country to Urban_population_pct_of_total
dtypes: float64(355), int64(2), object(1)
memory usage: 565.1+ KB

[17]:

df.head(2)

[17]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
0	Afghanistan	1	1	151.0	28.0	NaN	NaN	NaN	26088.0	4.0	...	692.50	NaN	NaN	NaN	257.00	231.9	257.00	5740436.0	5.44	22.9
1	Albania	2	2	27.0	98.7	6000.0	93.0	94.0	3172.0	0.6	...	3499.12	4.790000e+09	78.14	-2.040000e+09	18.47	15.5	18.47	1431793.9	2.21	45.4

2 rows × 358 columns

[21]:

df.tail(2)

[21]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
200	Zambia	201	3	161.0	68.0	1140.0	94.0	90.0	11696.0	1.9	...	2366.94	4.090000e+09	10.41	-447000000.0	175.3	163.8	175.3	4017411.0	1.95	35.0
201	Zimbabwe	202	3	101.0	89.5	NaN	88.0	87.0	13228.0	0.8	...	11457.33	5.620000e+09	3.39	-171000000.0	106.5	67.0	106.5	4709965.0	1.90	35.9

2 rows × 358 columns

[20]:

df.tail(2)

[20]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
200	Zambia	201	3	161.0	68.0	1140.0	94.0	90.0	11696.0	1.9	...	2366.94	4.090000e+09	10.41	-447000000.0	175.3	163.8	175.3	4017411.0	1.95	35.0
201	Zimbabwe	202	3	101.0	89.5	NaN	88.0	87.0	13228.0	0.8	...	11457.33	5.620000e+09	3.39	-171000000.0	106.5	67.0	106.5	4709965.0	1.90	35.9

2 rows × 358 columns

Carga de datos (segunda parte)

Desafortunadamente, la estructura y codificación de los datos en los archivos CSV varía según la herramienta o el sistema operativo. Por lo tanto, podemos encontrarnos con separadores entre columnas que no sean la típica coma (‘,’) o formatos de codificación de texto que no sean abiertos (por ejemplo, utf-8, ansi, etc.).

Es por esto que la función read_csv es muy versátil. Puedes consultar su documentación.

Vamos a ver qué sucede cuando se obtienen datos de una administración pública. Puedes encontrar el archivo disponible en: ‘data/presupuesto_gastos_2023.csv’. Puedes acceder a estos datos a través del siguiente enlace.

[20]:

df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv",encoding="cp1250",sep=";") # quins errors genera ?

[21]:

df_gastos

[21]:

	_id	SOZIETATEA_EU/SOCIEDAD_EU	SOZIETATEA_CAS/SOCIEDAD_CAS	EKITALDIA/EJERCICIO	SAILA/DEPARTAMENTO	SAILAREN DESKRIBAPENA_EU/DESCRIPCION DEPARTAMENTO_EU	SAILAREN DESKRIBAPENA_EU/DESCRIPCION DEPARTAMENTO_CAS	ZENTRU KUDEATZAILEA/CENTRO GESTOR	ZENTRO KUDEATZAILEAREN DESKR._EUS/DESCR. CENTRO GESTOR_EUS	ZENTRO KUDEATZAILEAREN DESKR._CAS/DESCR. CENTRO GESTOR_CAS	...	ARTIKULUAREN DESKRIBAPENA_CAS/DESCRIPCION ARTICULO_CAS	KONTZEPTUA/CONCEPTO	KONTZEPTUAREN DESKRIBAPENA_EUS/DESCRIPCION CONCEPTO_EUS	KONTZEPTUAREN DESKRIBAPENA_CAS/DESCRIPCION CONCEPTO_CAS	AZPIKONTZEPTUA/SUBCONCEPTO	AZPIKONTZEPTUAREN DESKRIBAPENA_EUS/DESCRIPCION SUBCONCEPTO_EUS	AZPIKONTZEPTUAREN DESKRIBAPENA_CAS/DESCRIPCION SUBCONCEPTO_CAS	PROIEKTUA/PROYECTO	PROIEKTUAREN DESKRIBAPENA/DESCRIPCION PROYECTO	HASIERAKO KREDITUA/CREDITO INICIAL 2023
0	1	UDAL	UDAL	2023	110	KULTURA ETA GOBERNANTZA	CULTURA Y GOBERNANZA	1100	KULTURA ETA GOBERNANTZA	CULTURA Y GOBERNANZA	...	RETRIBUCIONES DE ALTOS CARGOS	100	GOI KARGUEN OINARRIZKO SOLDATAK ETA BESTELAKO ...	RETRIBUCIONES BASICAS Y OTRAS REMUNERACIONES D...	10001	GOI KARGUEN OINARRIZKO SOLDATAK ETA BESTELAKO ...	RETRIBUCIONES BASICAS Y OTRAS REMUNERACIONES D...	9999/99999	GENERIKOA/ GENÉRICO	220.619,00
1	2	UDAL	UDAL	2023	110	KULTURA ETA GOBERNANTZA	CULTURA Y GOBERNANZA	1100	KULTURA ETA GOBERNANTZA	CULTURA Y GOBERNANZA	...	RETRIBUCIONES DEL PERSONAL EVENTUAL DE GABINETES	110	KABINETEETAKO ALDI BAT. PERTSONALAREN OINARRIZ...	RETRIBUCIONES BASICAS Y OTRAS REMUNERACIONES D...	11001	KABINETEETAKO ALDI BAT. PERTSONALAREN OINARRIZ...	RETRIBUCIONES BASICAS Y OTRAS REMUNERACIONES D...	9999/99999	GENERIKOA/ GENÉRICO	589.261,00
2	3	UDAL	UDAL	2023	110	KULTURA ETA GOBERNANTZA	CULTURA Y GOBERNANZA	1100	KULTURA ETA GOBERNANTZA	CULTURA Y GOBERNANZA	...	RETRIBUCIONES DEL PERSONAL FUNCIONARIO	120	FUNTZIONARIOEN OINARRIZKO SOLDATAK	RETRIBUCIONES BASICAS DEL PERSONAL FUNCIONARIO	12001	FUNTZIONARIOEN OINARRIZKO SOLDATAK	RETRIBUCIONES BASICAS DEL PERSONAL FUNCIONARIO	9999/99999	GENERIKOA/ GENÉRICO	383.369,00
3	4	UDAL	UDAL	2023	110	KULTURA ETA GOBERNANTZA	CULTURA Y GOBERNANZA	1100	KULTURA ETA GOBERNANTZA	CULTURA Y GOBERNANZA	...	RETRIBUCIONES DEL PERSONAL FUNCIONARIO	121	FUNTZIONARIOEN ORDAINSARI OSAGARRIAK	RETRIBUCIONES COMPLEMENTARIAS DEL PERSONAL FUN...	12101	LANTOKI OSAGARRIA	COMPLEMENTO DE DESTINO	9999/99999	GENERIKOA/ GENÉRICO	131.722,00
4	5	UDAL	UDAL	2023	110	KULTURA ETA GOBERNANTZA	CULTURA Y GOBERNANZA	1100	KULTURA ETA GOBERNANTZA	CULTURA Y GOBERNANZA	...	RETRIBUCIONES DEL PERSONAL FUNCIONARIO	121	FUNTZIONARIOEN ORDAINSARI OSAGARRIAK	RETRIBUCIONES COMPLEMENTARIAS DEL PERSONAL FUN...	12102	OSAGARRI BEREZIA	COMPLEMENTO ESPECIFICO	9999/99999	GENERIKOA/ GENÉRICO	396.609,00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2088	2089	UDAL	UDAL	2023	A10	ALKATETZAKO KABINETEA	GABINETE DE ALCALDÍA	A101	PROTOKOLOKO ETA HARREMAN PUBLIKOEN BULEGOA	OFICINA PROTOCOLO, RR.PP. E NSTITUCIONALES	...	MATERIAL, SUMINISTROS Y OTROS	227	KANPOKO ENPRESEK EGINDAKO LANAK	TRABAJOS REALIZADOS POR OTRAS EMPRESAS EXTERNAS	22717	BIDAI AGENTZIETAKO ZERBITZUAK	SERVICIOS DE AGENCIAS DE VIAJES	2006/00184	Harreman Publikoetako Kabineteari lotutako gas...	500,00
2089	2090	UDAL	UDAL	2023	A10	ALKATETZAKO KABINETEA	GABINETE DE ALCALDÍA	A101	PROTOKOLOKO ETA HARREMAN PUBLIKOEN BULEGOA	OFICINA PROTOCOLO, RR.PP. E NSTITUCIONALES	...	MATERIAL, SUMINISTROS Y OTROS	227	KANPOKO ENPRESEK EGINDAKO LANAK	TRABAJOS REALIZADOS POR OTRAS EMPRESAS EXTERNAS	22720	ITZULPEN ETA INTERPRETARITZA ZERBITZUAK	SERVICIOS DE TRADUCCION E INTERPRETES	2006/00182	Protokoloko eta ordezkaritzako arreta/Harreman...	2.000,00
2090	2091	UDAL	UDAL	2023	A10	ALKATETZAKO KABINETEA	GABINETE DE ALCALDÍA	A101	PROTOKOLOKO ETA HARREMAN PUBLIKOEN BULEGOA	OFICINA PROTOCOLO, RR.PP. E NSTITUCIONALES	...	MATERIAL, SUMINISTROS Y OTROS	227	KANPOKO ENPRESEK EGINDAKO LANAK	TRABAJOS REALIZADOS POR OTRAS EMPRESAS EXTERNAS	22723	ZERBITZUAK:ARGIAK ETA SOINUAK	SERVICIOS DE ILUMINACIÓN Y SONIDO	2006/00182	Protokoloko eta ordezkaritzako arreta/Harreman...	10.000,00
2091	2092	UDAL	UDAL	2023	A10	ALKATETZAKO KABINETEA	GABINETE DE ALCALDÍA	A101	PROTOKOLOKO ETA HARREMAN PUBLIKOEN BULEGOA	OFICINA PROTOCOLO, RR.PP. E NSTITUCIONALES	...	MATERIAL, SUMINISTROS Y OTROS	227	KANPOKO ENPRESEK EGINDAKO LANAK	TRABAJOS REALIZADOS POR OTRAS EMPRESAS EXTERNAS	22799	KANPOKO ENPRESEK EGINDAKO BESTELAKO LANAK	OTROS TRABAJOS REALIZADOS POR EMPRESAS EXTERNAS	2006/00182	Protokoloko eta ordezkaritzako arreta/Harreman...	48.400,00
2092	2093	UDAL	UDAL	2023	A10	ALKATETZAKO KABINETEA	GABINETE DE ALCALDÍA	A101	PROTOKOLOKO ETA HARREMAN PUBLIKOEN BULEGOA	OFICINA PROTOCOLO, RR.PP. E NSTITUCIONALES	...	INDEMNIZACIONES POR RAZON DEL SERVICIO	230	BIDAI SARIAK, PERTSONALAREN GARRAIOA ETA LOKOM...	DIETAS, LOCOMOCION Y TRASLADO DEL PERSONAL	23001	BIDAI SARIAK, PERTSONALAREN GARRAIOA ETA LOKOM...	DIETAS, LOCOMOCION Y TRASLADO DEL PERSONAL	2006/00184	Harreman Publikoetako Kabineteari lotutako gas...	500,00

2093 rows × 37 columns

Codificación de caracteres

Python utilitza una representación basada en Unicode (https://home.unicode.org/). Otros sistemas operativos y programas utilizan otro tipo de representaciones.

var = "camión"
var = "lul·lià"
var = "Ζεύς"
var = "ประเทศไทย"
var = "日本語で"

Codificaciones: - listado de codificaciones - UTF-8

[ ]:

# cp1250 | windows-1250 | Central and Eastern Europe
df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv",sep=";")  #Aun tenemos otro fallo, vamos a verlo

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb Cell 32 line <cell line: 3>()
      <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#X35sZmlsZQ%3D%3D?line=0'>1</a> # cp1250 | windows-1250 | Central and Eastern Europe
----> <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#X35sZmlsZQ%3D%3D?line=2'>3</a> df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv",sep=";")

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:950, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    946     defaults={"delimiter": ","},
    947 )
    948 kwds.update(kwds_defaults)
--> 950 return _read(filepath_or_buffer, kwds)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:605, in _read(filepath_or_buffer, kwds)
    602 _validate_names(kwds.get("names", None))
    604 # Create the parser.
--> 605 parser = TextFileReader(filepath_or_buffer, **kwds)
    607 if chunksize or iterator:
    608     return parser

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1442, in TextFileReader.__init__(self, f, engine, **kwds)
   1439     self.options["has_index_names"] = kwds["has_index_names"]
   1441 self.handles: IOHandles | None = None
-> 1442 self._engine = self._make_engine(f, self.engine)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1753, in TextFileReader._make_engine(self, f, engine)
   1750     raise ValueError(msg)
   1752 try:
-> 1753     return mapping[engine](f, **self.options)
   1754 except Exception:
   1755     if self.handles is not None:

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py:79, in CParserWrapper.__init__(self, src, **kwds)
     76     kwds.pop(key, None)
     78 kwds["dtype"] = ensure_dtype_objs(kwds.get("dtype", None))
---> 79 self._reader = parsers.TextReader(src, **kwds)
     81 self.unnamed_cols = self._reader.unnamed_cols
     83 # error: Cannot determine type of 'names'

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:547, in pandas._libs.parsers.TextReader.__cinit__()

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:636, in pandas._libs.parsers.TextReader._get_header()

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:852, in pandas._libs.parsers.TextReader._tokenize_rows()

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:1965, in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 1697: invalid start byte

[ ]:

df_gastos

[6]:

df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv",delimiter="\t",encoding="cp1250")

Otras maneras de cargar datos

[22]:

# por la dirección del fichero en web
df_who = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe

[23]:

df_who.head(2)

[23]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
0	Afghanistan	1	1	151.0	28.0	NaN	NaN	NaN	26088.0	4.0	...	692.50	NaN	NaN	NaN	257.00	231.9	257.00	5740436.0	5.44	22.9
1	Albania	2	2	27.0	98.7	6000.0	93.0	94.0	3172.0	0.6	...	3499.12	4.790000e+09	78.14	-2.040000e+09	18.47	15.5	18.47	1431793.9	2.21	45.4

2 rows × 358 columns

[33]:

# extracción de contenido en una página web
url = "https://es.wikipedia.org/wiki/Anexo:Comunidades_y_ciudades_aut%C3%B3nomas_de_Espa%C3%B1a"

comunidades_esp = pd.io.html.read_html(url)
comunidades_esp[0] # Alerta!

[33]:

	Unnamed: 0	Nombre	Capital (de iure o en su defecto de facto)	Población (2022)	Porcentaje población	Densidad (hab./km²)	Superficie (km²)	Porcentaje superficie	Mapa	PIB per cápita en € (2021)
0	1	Andalucía	Sevilla	NaN	17,87 %	9705	NaN	17,31 %	NaN	NaN
1	2	Cataluña	Barcelona	NaN	16,38 %	23884	NaN	6,35 %	NaN	NaN
2	3	Comunidad de Madrid	Madrid	NaN	14,24 %	84115	NaN	1,59 %	NaN	NaN
3	4	Comunidad Valenciana	Valencia	NaN	10,67 %	21698	NaN	4,60 %	NaN	NaN
4	5	Galicia	Santiago de Compostela	NaN	5,68 %	9119	NaN	5,84 %	NaN	NaN
5	6	Castilla y León	Valladolid[nota 1]	NaN	5,02 %	2534	NaN	18,62 %	NaN	NaN
6	7	País Vasco	Vitoria[nota 1]	NaN	4,67 %	30213	NaN	1,43 %	NaN	NaN
7	8	Canarias	Las Palmas de Gran Canaria y Santa Cruz de Ten...	NaN	4,58 %	30139	[nota 3]	1,47 %	NaN	NaN
8	9	Castilla-La Mancha	Toledo[nota 1]	NaN	4,32 %	2579	NaN	15,70 %	NaN	NaN
9	10	Región de Murcia	Murcia	NaN	3,20 %	13374	NaN	2,24 %	NaN	NaN
10	11	Aragón	Zaragoza	NaN	2,79 %	2790	NaN	9,43 %	NaN	NaN
11	12	Islas Baleares	Palma	NaN	2,47 %	24427	NaN	0,99 %	NaN	NaN
12	13	Extremadura	Mérida	NaN	2,23 %	2541	NaN	8,23 %	NaN	NaN
13	14	Principado de Asturias	Oviedo	NaN	2,14 %	9553	NaN	2,13 %	NaN	NaN
14	15	Comunidad Foral de Navarra	Pamplona	NaN	1,39 %	6330	NaN	2,05 %	NaN	NaN
15	16	Cantabria	Santander	NaN	1,23 %	10974	NaN	1,05 %	NaN	NaN
16	17	La Rioja	Logroño	NaN	0,67 %	6267	NaN	1,00 %	NaN	NaN
17	18	Melilla	Melilla	NaN	0,18 %	700158	NaN	<0,01 %	NaN	NaN
18	19	Ceuta	Ceuta	NaN	0,18 %	417510	NaN	<0,01 %	NaN	NaN
19	TOTAL	España	Madrid	NaN	100 %	9367	NaN	100 %	—	NaN

[10]:

print(type(comunidades_esp[0]))
df_comunidades_esp = comunidades_esp[0]
df_comunidades_esp.head()

<class 'pandas.core.frame.DataFrame'>

[10]:

	Unnamed: 0	Nombre	Capital (de iure o en su defecto de facto)	Población (2022)	Porcentaje población	Densidad (hab./km²)	Superficie (km²)	Porcentaje superficie	Mapa	PIB per cápita en € (2021)
0	1	Andalucía	Sevilla	NaN	17,87 %	9705	NaN	17,31 %	NaN	NaN
1	2	Cataluña	Barcelona	NaN	16,38 %	23884	NaN	6,35 %	NaN	NaN
2	3	Comunidad de Madrid	Madrid	NaN	14,24 %	84115	NaN	1,59 %	NaN	NaN
3	4	Comunidad Valenciana	Valencia	NaN	10,67 %	21698	NaN	4,60 %	NaN	NaN
4	5	Galicia	Santiago de Compostela	NaN	5,68 %	9119	NaN	5,84 %	NaN	NaN

[34]:

# Contenido JSON
#Fuente: https://data.cityofnewyork.us/Environment/Urban-Park-Ranger-Animal-Condition-Response/fuhs-xmg2
url = 'https://data.cityofnewyork.us/resource/s3vf-x992.json'
df = pd.read_json(url)
df.head()

[34]:

	date_and_time_of_initial	date_and_time_of_ranger	borough	property	location	species_description	call_source	species_status	animal_condition	duration_of_response	...	_311sr_number	final_ranger_action	of_animals	pep_response	animal_monitored	police_response	esu_response	acc_intake_number	hours_spent_monitoring	rehabilitator
0	2021-06-23T16:45:00.000	2021-06-24T08:00:00.000	Brooklyn	Sternberg Park	Inside locked athletic field under construction	Chukar	Other	Exotic	Healthy	6.00	...	311-06712416	ACC	6	False	False	False	False	163537	NaN	NaN
1	2021-06-24T10:00:00.000	2021-06-24T11:00:00.000	Bronx	Haffen Park	Haffen Pool	Sparrow	Central	Native	Healthy	1.75	...	311-06714879	Rehabilitator	4	False	False	False	False	NaN	NaN	NaN
2	2021-06-23T14:30:00.000	2021-06-23T14:30:00.000	Bronx	Pelham Bay Park	Pelham Bay South	White-tailed Deer	Employee	Native	N/A	1.00	...	NaN	Unfounded	0	False	False	False	False	NaN	NaN	NaN
3	2021-06-23T13:00:00.000	2021-06-23T13:10:00.000	Staten Island	Willowbrook Park	The carousel	Raccoon	Employee	Native	N/A	2.00	...	NaN	Unfounded	0	False	False	False	False	NaN	NaN	NaN
4	2021-06-23T09:20:00.000	2021-06-23T09:20:00.000	Queens	Judge Moses Weinstein Playground	Garbage can	Virginia Opossum	Central	Native	Healthy	2.25	...	311-06699415	ACC	1	False	False	False	False	119833	NaN	NaN

5 rows × 22 columns

Actividades

En esta actividad practicaremos la carga de datos en diferentes formatos. En el mundo real, los datos no siempre tienen una estructura y un formato como desearíamos.

El objetivo es que analices la carga de estos datos con los datos originales: - ¿Qué dimensión tienen los datos reales y los cargados? - ¿Cuáles son las columnas? - ¿El concepto de columna como atributo o característica y el concepto de fila como muestra están presentes en la estructura de los datos? - ¿Coinciden con la información del archivo?

[13]:

# 1. Descarga y carga el siguiente fichero:
# https://data.cityofnewyork.us/Housing-Development/Speculation-Watch-List/adax-9mit

[14]:

# 2. Descarga y carga el siguiente fichero:
# https://ibestat.caib.es/ibestat/estadistiques/poblacio/naixements/414cab4f-b402-4cd1-af05-6617443de384

[15]:

# 3. Y ahora Descargalo y abrelo comprimido!!!. Es decir, sin descomprimir!
# https://ec.europa.eu/eurostat/databrowser/view/tin00171/default/table?lang=en
# Los archivos comprimidos en formato .gz se pueden abrir directamente como si fueran archivos de datos con Pandas, y en este caso, son archivos del tipo CSV. Es decir, no es necesario descomprimirlos.

Guardando un dataframe

Durante este curso aprenderemos a modificar los dataframes, agregaremos y eliminaremos columnas, y también modificaremos las que ya tenemos. Por lo tanto, después de realizar este trabajo, es necesario guardar los nuevos datos en un archivo.

[ ]:

df_gastos.to_csv('data/tmp_file.csv',encoding='utf-8') # guardant un dataframe en un fitxer, especificant el format

Estructura del dataframe

Ahora que ya sabemos cargar dataframes desde archivos, descubriremos cómo podemos acceder a la información que se encuentra dentro de los tipos de variables utilizados por Pandas.

Un dataframe tiene columnas y filas. Las filas son muestras y las columnas representan características de una muestra. Una columna es del tipo Serie.

El objetivo de esta unidad es adquirir herramientas para comprender y seleccionar los datos representados en un dataframe y una serie con Pandas.

Comenzaremos seleccionando columnas y obteniendo resúmenes estadísticos de ellas. Más adelante, pasaremos a realizar selecciones de filas en el dataframe. Finalmente, realizaremos selecciones combinadas creando nuestros propios dataframes a partir de los subconjuntos seleccionados.”

[3]:

import pandas as pd

df_who = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe

[4]:

df_who.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Columns: 358 entries, Country to Urban_population_pct_of_total
dtypes: float64(355), int64(2), object(1)
memory usage: 565.1+ KB

[5]:

df_who.describe()

[5]:

	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	Population in urban areas (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
count	202.000000	202.000000	177.000000	131.000000	178.000000	179.000000	179.000000	1.930000e+02	193.000000	193.000000	...	1.860000e+02	1.780000e+02	128.000000	1.710000e+02	181.000000	170.000000	181.000000	1.880000e+02	188.000000	188.000000
mean	101.500000	3.579208	59.457627	78.871756	11250.112360	84.033520	85.698324	3.409805e+04	1.297927	54.911917	...	1.483596e+05	2.015567e+11	57.253516	3.424012e+08	56.677624	54.356471	56.677624	1.665763e+07	2.165851	55.195213
std	58.456537	1.808263	49.105286	20.415760	12586.753417	17.788047	15.451212	1.304957e+05	1.163864	23.554182	...	6.133091e+05	9.400689e+11	138.669298	5.943043e+10	60.060929	61.160556	60.060929	5.094867e+07	1.596628	23.742122
min	1.000000	1.000000	0.000000	23.600000	260.000000	6.000000	11.000000	2.000000e+00	-2.500000	10.000000	...	2.565000e+01	5.190000e+07	0.990000	-7.140000e+11	2.900000	3.000000	2.900000	1.545600e+04	-1.160000	10.000000
25%	51.250000	2.000000	19.000000	68.400000	2112.500000	79.000000	79.500000	1.340000e+03	0.500000	36.000000	...	1.672615e+03	3.317500e+09	16.292500	-1.210000e+09	12.400000	8.475000	12.400000	9.171623e+05	1.105000	35.650000
50%	101.500000	3.000000	46.000000	86.500000	6175.000000	90.000000	90.000000	6.762000e+03	1.300000	57.000000	...	1.021157e+04	1.145000e+10	28.515000	-2.240000e+08	29.980000	27.600000	29.980000	3.427661e+06	1.945000	57.300000
75%	151.750000	5.000000	91.000000	95.300000	14502.500000	96.000000	96.000000	2.173200e+04	2.100000	73.000000	...	6.549217e+04	8.680000e+10	55.310000	1.024000e+09	88.700000	82.900000	88.700000	9.837113e+06	3.252500	72.750000
max	202.000000	7.000000	199.000000	99.800000	60870.000000	100.000000	100.000000	1.328474e+06	4.300000	100.000000	...	5.776432e+06	1.100000e+13	1334.860000	1.390000e+11	267.000000	253.700000	267.000000	5.270000e+08	7.850000	100.000000

8 rows × 357 columns

Columnes

Com hem comentat a l’introducció, començarem amb les columnes

[6]:

df_who.head()

[6]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
0	Afghanistan	1	1	151.0	28.0	NaN	NaN	NaN	26088.0	4.0	...	692.50	NaN	NaN	NaN	257.00	231.9	257.00	5740436.0	5.44	22.9
1	Albania	2	2	27.0	98.7	6000.0	93.0	94.0	3172.0	0.6	...	3499.12	4.790000e+09	78.14	-2.040000e+09	18.47	15.5	18.47	1431793.9	2.21	45.4
2	Algeria	3	3	6.0	69.9	5940.0	94.0	96.0	33351.0	1.5	...	137535.56	6.970000e+10	351.36	4.700000e+09	40.00	31.2	40.00	20800000.0	2.61	63.3
3	Andorra	4	2	NaN	NaN	NaN	83.0	83.0	74.0	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Angola	5	3	146.0	67.4	3890.0	49.0	51.0	16557.0	2.8	...	8991.46	1.490000e+10	27.13	9.140000e+09	164.10	242.5	164.10	8578749.0	4.14	53.3

5 rows × 358 columns

[8]:

df_who.shape

[8]:

(202, 358)

[9]:

# Columnas o características de cada muestra
print(df_who.columns)

Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
       'Adult literacy rate (%)',
       'Gross national income per capita (PPP international $)',
       'Net primary school enrolment ratio female (%)',
       'Net primary school enrolment ratio male (%)',
       'Population (in thousands) total', 'Population annual growth rate (%)',
       ...
       'Total_CO2_emissions', 'Total_income', 'Total_reserves',
       'Trade_balance_goods_and_services', 'Under_five_mortality_from_CME',
       'Under_five_mortality_from_IHME', 'Under_five_mortality_rate',
       'Urban_population', 'Urban_population_growth',
       'Urban_population_pct_of_total'],
      dtype='object', length=358)

[11]:

for ix, col in enumerate(df_who.columns): #label
    print("ix:%i\tlabel:%s"%(ix,col))

ix:0    label:Country
ix:1    label:CountryID
ix:2    label:Continent
ix:3    label:Adolescent fertility rate (%)
ix:4    label:Adult literacy rate (%)
ix:5    label:Gross national income per capita (PPP international $)
ix:6    label:Net primary school enrolment ratio female (%)
ix:7    label:Net primary school enrolment ratio male (%)
ix:8    label:Population (in thousands) total
ix:9    label:Population annual growth rate (%)
ix:10   label:Population in urban areas (%)
ix:11   label:Population living below the poverty line (% living on &lt; US$1 per day)
ix:12   label:Population median age (years)
ix:13   label:Population proportion over 60 (%)
ix:14   label:Population proportion under 15 (%)
ix:15   label:Registration coverage of births (%)
ix:16   label:Total fertility rate (per woman)
ix:17   label:Antenatal care coverage - at least four visits (%)
ix:18   label:Antiretroviral therapy coverage among HIV-infected pregt women for PMTCT (%)
ix:19   label:Antiretroviral therapy coverage among people with advanced HIV infections (%)
ix:20   label:Births attended by skilled health personnel (%)
ix:21   label:Births by caesarean section (%)
ix:22   label:Children aged 6-59 months who received vitamin A supplementation (%)
ix:23   label:Children aged &lt;5 years sleeping under insecticide-treated nets (%)
ix:24   label:Children aged &lt;5 years who received any antimalarial treatment for fever (%)
ix:25   label:Children aged &lt;5 years with ARI symptoms taken to facility (%)
ix:26   label:Children aged &lt;5 years with diarrhoea receiving ORT (%)
ix:27   label:Contraceptive prevalence (%)
ix:28   label:Neonates protected at birth against neonatal tetanus (PAB) (%)
ix:29   label:One-year-olds immunized with MCV
ix:30   label:One-year-olds immunized with three doses of diphtheria tetanus toxoid and pertussis (DTP3) (%)
ix:31   label:One-year-olds immunized with three doses of Hepatitis B (HepB3) (%)
ix:32   label:One-year-olds immunized with three doses of Hib (Hib3) vaccine (%)
ix:33   label:Tuberculosis detection rate under DOTS (%)
ix:34   label:Tuberculosis treatment success under DOTS (%)
ix:35   label:Women who have had mammography (%)
ix:36   label:Women who have had PAP smear (%)
ix:37   label:Community and traditional health workers density (per 10 000 population)
ix:38   label:Dentistry personnel density (per 10 000 population)
ix:39   label:Environment and public health workers density (per 10 000 population)
ix:40   label:External resources for health as percentage of total expenditure on health
ix:41   label:General government expenditure on health as percentage of total expenditure on health
ix:42   label:General government expenditure on health as percentage of total government expenditure
ix:43   label:Hospital beds (per 10 000 population)
ix:44   label:Laboratory health workers density (per 10 000 population)
ix:45   label:Number of community and traditional health workers
ix:46   label:Number of dentistry personnel
ix:47   label:Number of environment and public health workers
ix:48   label:Number of laboratory health workers
ix:49   label:Number of nursing and midwifery personnel
ix:50   label:Number of other health service providers
ix:51   label:Number of pharmaceutical personnel
ix:52   label:Number of physicians
ix:53   label:Nursing and midwifery personnel density (per 10 000 population)
ix:54   label:Other health service providers density (per 10 000 population)
ix:55   label:Out-of-pocket expenditure as percentage of private expenditure on health
ix:56   label:Per capita government expenditure on health (PPP int. $)
ix:57   label:Per capita government expenditure on health at average exchange rate (US$)
ix:58   label:Per capita total expenditure on health (PPP int. $)
ix:59   label:Per capita total expenditure on health at average exchange rate (US$)
ix:60   label:Pharmaceutical personnel density (per 10 000 population)
ix:61   label:Physicians density (per 10 000 population)
ix:62   label:Private expenditure on health as percentage of total expenditure on health
ix:63   label:Private prepaid plans as percentage of private expenditure on health
ix:64   label:Ratio of health management and support workers to health service providers
ix:65   label:Ratio of nurses and midwives to physicians
ix:66   label:Social security expenditure on health as percentage of general government expenditure on health
ix:67   label:Total expenditure on health as percentage of gross domestic product
ix:68   label:Births attended by skilled health personnel (%) highest educational level of mother
ix:69   label:Births attended by skilled health personnel (%) highest wealth quintile
ix:70   label:Births attended by skilled health personnel (%) lowest educational level of mother
ix:71   label:Births attended by skilled health personnel (%) lowest wealth quintile
ix:72   label:Births attended by skilled health personnel (%) rural
ix:73   label:Births attended by skilled health personnel (%) urban
ix:74   label:Births attended by skilled health personnel difference highest lowest educational level of mother
ix:75   label:Births attended by skilled health personnel difference highest-lowest wealth quintile
ix:76   label:Births attended by skilled health personnel difference urban-rural
ix:77   label:Births attended by skilled health personnel ratio highest-lowest educational level of mother
ix:78   label:Births attended by skilled health personnel ratio highest-lowest wealth quintile
ix:79   label:Births attended by skilled health personnel ratio urban-rural
ix:80   label:Measles immunization coverage among one-year-olds (%) highest educational level of mother
ix:81   label:Measles immunization coverage among one-year-olds (%) highest wealth quintile
ix:82   label:Measles immunization coverage among one-year-olds (%) lowest educational level of mother
ix:83   label:Measles immunization coverage among one-year-olds (%) lowest wealth quintile
ix:84   label:Measles immunization coverage among one-year-olds (%) rural
ix:85   label:Measles immunization coverage among one-year-olds (%) urban
ix:86   label:Measles immunization coverage among one-year-olds difference highest-lowest educational level of mother
ix:87   label:Measles immunization coverage among one-year-olds difference highest-lowest wealth quintile
ix:88   label:Measles immunization coverage among one-year-olds difference urban-rural
ix:89   label:Measles immunization coverage among one-year-olds ratio highest-lowest educational level of mother
ix:90   label:Measles immunization coverage among one-year-olds ratio highest-lowest wealth quintile
ix:91   label:Measles immunization coverage among one-year-olds ratio urban-rural
ix:92   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) difference lowest-highest educational level of mother
ix:93   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) difference lowest-highest wealth quintile
ix:94   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) difference rural-urban
ix:95   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) highest educational level of mother
ix:96   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) highest wealth quintile
ix:97   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) lowest educational level of mother
ix:98   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) lowest wealth quintile
ix:99   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) ratio lowest-highest educational level of mother
ix:100  label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) ratio lowest-highest wealth quintile
ix:101  label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) ratio rural-urban
ix:102  label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) rural
ix:103  label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) urban
ix:104  label:Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) both sexes
ix:105  label:Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) female
ix:106  label:Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) male
ix:107  label:Age-standardized mortality rate for cancer (per 100 000 population)
ix:108  label:Age-standardized mortality rate for cardiovascular diseases (per 100 000 population)
ix:109  label:Age-standardized mortality rate for injuries (per 100 000 population)
ix:110  label:Age-standardized mortality rate for non-communicable diseases (per 100 000 population)
ix:111  label:Deaths among children under five years of age due to diarrhoeal diseases (%)
ix:112  label:Deaths among children under five years of age due to HIV/AIDS (%)
ix:113  label:Deaths among children under five years of age due to injuries (%)
ix:114  label:Deaths among children under five years of age due to malaria (%)
ix:115  label:Deaths among children under five years of age due to measles (%)
ix:116  label:Deaths among children under five years of age due to neonatal causes (%)
ix:117  label:Deaths among children under five years of age due to other causes (%)
ix:118  label:Deaths among children under five years of age due to pneumonia (%)
ix:119  label:Deaths due to HIV/AIDS (per 100 000 population per year)
ix:120  label:Deaths due to tuberculosis among HIV-negative people (per 100 000 population)
ix:121  label:Deaths due to tuberculosis among HIV-positive people (per 100 000 population)
ix:122  label:Healthy life expectancy (HALE) at birth (years) both sexes
ix:123  label:Healthy life expectancy (HALE) at birth (years) female
ix:124  label:Healthy life expectancy (HALE) at birth (years) male
ix:125  label:Incidence of tuberculosis (per 100 000 population per year)
ix:126  label:Infant mortality rate (per 1 000 live births) both sexes
ix:127  label:Infant mortality rate (per 1 000 live births) female
ix:128  label:Infant mortality rate (per 1 000 live births) male
ix:129  label:Life expectancy at birth (years) both sexes
ix:130  label:Life expectancy at birth (years) female
ix:131  label:Life expectancy at birth (years) male
ix:132  label:Maternal mortality ratio (per 100 000 live births)
ix:133  label:Neonatal mortality rate (per 1 000 live births)
ix:134  label:Number of confirmed poliomyelitis cases
ix:135  label:Prevalence of HIV among adults aged &gt;=15 years (per 100 000 population)
ix:136  label:Prevalence of tuberculosis (per 100 000 population)
ix:137  label:Under-5 mortality rate (probability of dying by age 5 per 1000 live births) both sexes
ix:138  label:Under-5 mortality rate (probability of dying by age 5 per 1000 live births) female
ix:139  label:Under-5 mortality rate (probability of dying by age 5 per 1000 live births) male
ix:140  label:Years of life lost to communicable diseases (%)
ix:141  label:Years of life lost to injuries (%)
ix:142  label:Years of life lost to non-communicable diseases (%)
ix:143  label:Children under five years of age overweight for age (%)
ix:144  label:Children under five years of age stunted for age (%)
ix:145  label:Children under five years of age underweight for age (%)
ix:146  label:Newborns with low birth weight (%)
ix:147  label:Per capita recorded alcohol consumption (litres of pure alcohol) among adults (&gt;=15 years)
ix:148  label:Population using solid fuels (%) rural
ix:149  label:Population using solid fuels (%) urban
ix:150  label:Population with sustainable access to improved drinking water sources (%) rural
ix:151  label:Population with sustainable access to improved drinking water sources (%) total
ix:152  label:Population with sustainable access to improved drinking water sources (%) urban
ix:153  label:Population with sustainable access to improved sanitation (%) rural
ix:154  label:Population with sustainable access to improved sanitation (%) total
ix:155  label:Population with sustainable access to improved sanitation (%) urban
ix:156  label:Prevalence of adults (&gt;=15 years) who are obese (%) female
ix:157  label:Prevalence of adults (&gt;=15 years) who are obese (%) male
ix:158  label:Prevalence of condom use by young people (15-24 years) at higher risk sex (%) female
ix:159  label:Prevalence of condom use by young people (15-24 years) at higher risk sex (%) male
ix:160  label:Prevalence of current tobacco use among adolescents (13-15 years) (%) both sexes
ix:161  label:Prevalence of current tobacco use among adolescents (13-15 years) (%) female
ix:162  label:Prevalence of current tobacco use among adolescents (13-15 years) (%) male
ix:163  label:Prevalence of current tobacco use among adults (&gt;=15 years) (%) both sexes
ix:164  label:Prevalence of current tobacco use among adults (&gt;=15 years) (%) female
ix:165  label:Prevalence of current tobacco use among adults (&gt;=15 years) (%) male
ix:166  label:Adolescent_fertility_rate
ix:167  label:Agricultural_land
ix:168  label:Agriculture_contribution_to_economy
ix:169  label:Aid_given
ix:170  label:Aid_received
ix:171  label:Aid_received_total
ix:172  label:All_forms_of_TB_new_cases_per_100_000_estimated
ix:173  label:All_forms_of_TB_new_cases_per_100_000_reported
ix:174  label:Annual_freshwater_withdrawals_total
ix:175  label:Arms_exports
ix:176  label:Arms_imports
ix:177  label:Bad_teeth_per_child
ix:178  label:Births_attended_by_skilled_health_staff
ix:179  label:Breast_cancer_deaths_per_100_000_women
ix:180  label:Breast_cancer_new_cases_per_100_000_women
ix:181  label:Breast_cancer_number_of_female_deaths
ix:182  label:Breast_cancer_number_of_new_female_cases
ix:183  label:Broadband_subscribers
ix:184  label:Broadband_subscribers_per_100_people
ix:185  label:CO2_emissions
ix:186  label:CO2_intensity_of_economic_output
ix:187  label:Capital_formation
ix:188  label:Cell_phones_per_100_people
ix:189  label:Cell_phones_total
ix:190  label:Central_bank_discount_rate
ix:191  label:Cervical_cancer_deaths_per_100_000_women
ix:192  label:Cervical_cancer_new_cases_per_100_000_women
ix:193  label:Cervical_cancer_number_of_female_deaths
ix:194  label:Cervical_cancer_number_of_new_female_cases
ix:195  label:Children_and_elderly
ix:196  label:Children_out_of_school_primary
ix:197  label:Children_out_of_school_primary_female
ix:198  label:Children_out_of_school_primary_male
ix:199  label:Children_per_woman
ix:200  label:Coal_consumption
ix:201  label:Coal_consumption_per_person
ix:202  label:Coal_production
ix:203  label:Coal_production_per_person
ix:204  label:Colon_and_Rectum_cancer_deaths_per_100_000_men
ix:205  label:Colon_and_Rectum_cancer_deaths_per_100_000_women
ix:206  label:Colon_and_Rectum_cancer_new_cases_per_100_000_men
ix:207  label:Colon_and_Rectum_cancer_new_cases_per_100_000_women
ix:208  label:Colon_and_Rectum_cancer_number_of_female_deaths
ix:209  label:Colon_and_Rectum_cancer_number_of_male_deaths
ix:210  label:Colon_and_Rectum_cancer_number_of_new_female_cases
ix:211  label:Colon_and_Rectum_cancer_number_of_new_male_cases
ix:212  label:Consumer_price_index
ix:213  label:Contraceptive_use
ix:214  label:Deaths_from_TB_per_100_000_estimated
ix:215  label:Debt_servicing_costs
ix:216  label:Democracy_score
ix:217  label:Electric_power_consumption
ix:218  label:Electricity_generation
ix:219  label:Electricity_generation_per_person
ix:220  label:Energy_use
ix:221  label:Expenditure_per_student_primary
ix:222  label:Expenditure_per_student_secondary
ix:223  label:Expenditure_per_student_tertiary
ix:224  label:Exports_of_goods_and_services
ix:225  label:Exports_unit_value
ix:226  label:External_debt_total_DOD_current_USdollars
ix:227  label:External_debt_total_pct_of_GNI
ix:228  label:Female_labour_force
ix:229  label:Fixed_line_and_mobile_phone_subscribers
ix:230  label:Foreign_direct_investment_net_inflows
ix:231  label:Foreign_direct_investment_net_outflows
ix:232  label:Forest_area
ix:233  label:Gross_capital_formation
ix:234  label:HIV_infected
ix:235  label:Health_expenditure_per_person
ix:236  label:Health_expenditure_private
ix:237  label:Health_expenditure_public_pct_of_GDP
ix:238  label:Health_expenditure_public_pct_of_government_expenditure
ix:239  label:Health_expenditure_public_pct_of_total_health_expenditure
ix:240  label:Health_expenditure_total
ix:241  label:High_technology_exports
ix:242  label:Hydroelectricity_consumption
ix:243  label:Hydroelectricity_consumption_per_person
ix:244  label:Imports_of_goods_and_services
ix:245  label:Imports_unit_value
ix:246  label:Improved_sanitation_facilities_urban
ix:247  label:Improved_water_source
ix:248  label:Income_growth
ix:249  label:Income_per_person
ix:250  label:Income_share_held_by_lowest_20pct
ix:251  label:Industry_contribution_to_economy
ix:252  label:Inequality_index
ix:253  label:Infant_mortality_rate
ix:254  label:Infectious_TB_new_cases_per_100_000_estimated
ix:255  label:Infectious_TB_new_cases_per_100_000_reported
ix:256  label:Infectious_TB_treatment_completeness
ix:257  label:Inflation_GDP_deflator
ix:258  label:Internet_users
ix:259  label:Life_expectancy_at_birth
ix:260  label:Literacy_rate_adult_female
ix:261  label:Literacy_rate_adult_male
ix:262  label:Literacy_rate_adult_total
ix:263  label:Literacy_rate_youth_female
ix:264  label:Literacy_rate_youth_male
ix:265  label:Literacy_rate_youth_total
ix:266  label:Liver_cancer_deaths_per_100_000_men
ix:267  label:Liver_cancer_deaths_per_100_000_women
ix:268  label:Liver_cancer_new_cases_per_100_000_men
ix:269  label:Liver_cancer_new_cases_per_100_000_women
ix:270  label:Liver_cancer_number_of_female_deaths
ix:271  label:Liver_cancer_number_of_male_deaths
ix:272  label:Liver_cancer_number_of_new_female_cases
ix:273  label:Liver_cancer_number_of_new_male_cases
ix:274  label:Lung_cancer_deaths_per_100_000_men
ix:275  label:Lung_cancer_deaths_per_100_000_women
ix:276  label:Lung_cancer_new_cases_per_100_000_men
ix:277  label:Lung_cancer_new_cases_per_100_000_women
ix:278  label:Lung_cancer_number_of_female_deaths
ix:279  label:Lung_cancer_number_of_male_deaths
ix:280  label:Lung_cancer_number_of_new_female_cases
ix:281  label:Lung_cancer_number_of_new_male_cases
ix:282  label:Malaria_prevention_insecticide_treated_bed_nets_usage
ix:283  label:Malaria_treatment
ix:284  label:Malnutrition_weight_for_age
ix:285  label:Market_value_of_listed_companies
ix:286  label:Maternal_mortality
ix:287  label:Math_achievement_4th_grade
ix:288  label:Math_achievement_8th_grade
ix:289  label:Measles_immunization
ix:290  label:Medical_Doctors
ix:291  label:Merchandise_trade
ix:292  label:Military_expenditure
ix:293  label:Natural_gas_consumption
ix:294  label:Natural_gas_consumption_per_person
ix:295  label:Natural_gas_production
ix:296  label:Natural_gas_production_per_person
ix:297  label:Natural_gas_proved_reserves
ix:298  label:Natural_gas_proven_reserves_per_person
ix:299  label:Net_barter_terms_of_trade
ix:300  label:Nuclear_consumption
ix:301  label:Nuclear_consumption_per_person
ix:302  label:Number_of_deaths_from_TB_estimated
ix:303  label:Number_of_existing_TB_cases_estimated
ix:304  label:Oil_consumption
ix:305  label:Oil_consumption_per_person
ix:306  label:Oil_production
ix:307  label:Oil_production_per_person
ix:308  label:Oil_proved_reserves
ix:309  label:Oil_proven_reserves_per_person
ix:310  label:Old_version_of_Income_per_person
ix:311  label:Patent_applications
ix:312  label:Patents_granted
ix:313  label:Patents_in_force
ix:314  label:People_living_with_HIV
ix:315  label:Personal_computers_per_100_people
ix:316  label:Personal_computers_total
ix:317  label:Population_growth
ix:318  label:Population_in_urban_agglomerations_more_than_1_million
ix:319  label:Population_total
ix:320  label:Poverty_headcount_ratio_at_national_poverty_line
ix:321  label:Present_value_of_debt
ix:322  label:Primary_completion_rate_total
ix:323  label:Primary_energy_consumption
ix:324  label:Primary_energy_consumption_per_person
ix:325  label:Primary_school_completion_pct_of_boys
ix:326  label:Primary_school_completion_pct_of_girls
ix:327  label:Prostate_cancer_deaths_per_100_000_men
ix:328  label:Prostate_cancer_new_cases_per_100_000_men
ix:329  label:Prostate_cancer_number_of_male_deaths
ix:330  label:Prostate_cancer_number_of_new_male_cases
ix:331  label:Pump_price_for_gasoline
ix:332  label:Ratio_of_girls_to_boys_in_primary_and_secondary_education
ix:333  label:Ratio_of_young_literate_females_to_males
ix:334  label:Roads_paved
ix:335  label:SO2_emissions_per_person
ix:336  label:Services_contribution_to_economy
ix:337  label:Stomach_cancer_deaths_per_100_000_men
ix:338  label:Stomach_cancer_deaths_per_100_000_women
ix:339  label:Stomach_cancer_new_cases_per_100_000_men
ix:340  label:Stomach_cancer_new_cases_per_100_000_women
ix:341  label:Stomach_cancer_number_of_female_deaths
ix:342  label:Stomach_cancer_number_of_male_deaths
ix:343  label:Stomach_cancer_number_of_new_female_cases
ix:344  label:Stomach_cancer_number_of_new_male_cases
ix:345  label:Sugar_per_person
ix:346  label:Surface_area
ix:347  label:Tax_revenue
ix:348  label:Total_CO2_emissions
ix:349  label:Total_income
ix:350  label:Total_reserves
ix:351  label:Trade_balance_goods_and_services
ix:352  label:Under_five_mortality_from_CME
ix:353  label:Under_five_mortality_from_IHME
ix:354  label:Under_five_mortality_rate
ix:355  label:Urban_population
ix:356  label:Urban_population_growth
ix:357  label:Urban_population_pct_of_total

Si inspeccionamos y comparamos los tipos del dataframe y de las columnas…

[12]:

type(df_who)

[12]:

pandas.core.frame.DataFrame

[13]:

type(df_who.columns)

[13]:

pandas.core.indexes.base.Index

Podemos utilizar el nombre de una columna para obtener los datos de dicha columna, tal como lo hacíamos con un diccionario python. Veremos dos maneras diferentes de hacerlo:

[15]:

paises = df_who["Country"]
paises

[15]:

0             Afghanistan
1                 Albania
2                 Algeria
3                 Andorra
4                  Angola
              ...
197               Vietnam
198    West Bank and Gaza
199                 Yemen
200                Zambia
201              Zimbabwe
Name: Country, Length: 202, dtype: object

¿Qué tipo de datos es una columna?

[45]:

type(df_who["Country"])

[45]:

pandas.core.series.Series

Las Series són la otra estructura básica de Pandas. Las filas y las columnas se estructuran en Series, se pueden ver cómo un tipo de lista que solamente puede contener un único tipo de datos, acepta operaciones vectoriales y se puede indexar de manera similar a un diccionario.

Una vez que seleccionamos una columna, podemos acceder a sus elementos como si fueran una lista:

[46]:

print(df_who["Country"][0])
print("-"*30)
print(df_who["Country"][:5]) # slicing
print("-"*30)
print(df_who["Country"].values)

Afghanistan
------------------------------
0    Afghanistan
1        Albania
2        Algeria
3        Andorra
4         Angola
Name: Country, dtype: object
------------------------------
['Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola'
 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Australia' 'Austria'
 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
 'Belgium' 'Belize' 'Benin' 'Bermuda' 'Bhutan' 'Bolivia'
 'Bosnia and Herzegovina' 'Botswana' 'Brazil' 'Brunei Darussalam'
 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia' 'Cameroon' 'Canada'
 'Cape Verde' 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia'
 'Comoros' 'Congo, Dem. Rep.' 'Congo, Rep.' 'Cook Islands' 'Costa Rica'
 "Cote d'Ivoire" 'Croatia' 'Cuba' 'Cyprus' 'Czech Republic' 'Denmark'
 'Djibouti' 'Dominica' 'Dominican Republic' 'Ecuador' 'Egypt'
 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Ethiopia' 'Fiji'
 'Finland' 'France' 'French Polynesia' 'Gabon' 'Gambia' 'Georgia'
 'Germany' 'Ghana' 'Greece' 'Grenada' 'Guatemala' 'Guinea' 'Guinea-Bissau'
 'Guyana' 'Haiti' 'Honduras' 'Hong Kong, China' 'Hungary' 'Iceland'
 'India' 'Indonesia' 'Iran (Islamic Republic of)' 'Iraq' 'Ireland'
 'Israel' 'Italy' 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya'
 'Kiribati' 'Korea, Dem. Rep.' 'Korea, Rep.' 'Kuwait' 'Kyrgyzstan'
 "Lao People's Democratic Republic" 'Latvia' 'Lebanon' 'Lesotho' 'Liberia'
 'Libyan Arab Jamahiriya' 'Lithuania' 'Luxembourg' 'Macao, China'
 'Macedonia' 'Madagascar' 'Malawi' 'Malaysia' 'Maldives' 'Mali' 'Malta'
 'Marshall Islands' 'Mauritania' 'Mauritius' 'Mexico'
 'Micronesia (Federated States of)' 'Moldova' 'Monaco' 'Mongolia'
 'Montenegro' 'Morocco' 'Mozambique' 'Myanmar' 'Namibia' 'Nauru' 'Nepal'
 'Netherlands' 'Netherlands Antilles' 'New Caledonia' 'New Zealand'
 'Nicaragua' 'Niger' 'Nigeria' 'Niue' 'Norway' 'Oman' 'Pakistan' 'Palau'
 'Panama' 'Papua New Guinea' 'Paraguay' 'Peru' 'Philippines' 'Poland'
 'Portugal' 'Puerto Rico' 'Qatar' 'Romania' 'Russia' 'Rwanda'
 'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and the Grenadines'
 'Samoa' 'San Marino' 'Sao Tome and Principe' 'Saudi Arabia' 'Senegal'
 'Serbia' 'Seychelles' 'Sierra Leone' 'Singapore' 'Slovakia' 'Slovenia'
 'Solomon Islands' 'Somalia' 'South Africa' 'Spain' 'Sri Lanka' 'Sudan'
 'Suriname' 'Swaziland' 'Sweden' 'Switzerland' 'Syria' 'Taiwan'
 'Tajikistan' 'Tanzania' 'Thailand' 'Timor-Leste' 'Togo' 'Tonga'
 'Trinidad and Tobago' 'Tunisia' 'Turkey' 'Turkmenistan' 'Tuvalu' 'Uganda'
 'Ukraine' 'United Arab Emirates' 'United Kingdom'
 'United States of America' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela'
 'Vietnam' 'West Bank and Gaza' 'Yemen' 'Zambia' 'Zimbabwe']

Existe otra manera más sencilla de seleccionar una única columna, pero existen nombres de columna muy largos: ‘Children aged <5 years who received any antimalarial treatment for fever (%)’.

Nota: En la creación de documentos, es importante usar nombres adecuados para las columnas.

[47]:

df_who.Country

[47]:

0             Afghanistan
1                 Albania
2                 Algeria
3                 Andorra
4                  Angola
              ...
197               Vietnam
198    West Bank and Gaza
199                 Yemen
200                Zambia
201              Zimbabwe
Name: Country, Length: 202, dtype: object

[21]:

print(df_who.columns[9])
print("-"*30)
print(df_who[df_who.columns[9]])

Population annual growth rate (%)
------------------------------
0      4.0
1      0.6
2      1.5
3      1.0
4      2.8
      ...
197    1.4
198    NaN
199    3.0
200    1.9
201    0.8
Name: Population annual growth rate (%), Length: 202, dtype: float64

[23]:

# Multiples columnas
df_who[df_who.columns[0:5]]

[23]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)
0	Afghanistan	1	1	151.0	28.0
1	Albania	2	2	27.0	98.7
2	Algeria	3	3	6.0	69.9
3	Andorra	4	2	NaN	NaN
4	Angola	5	3	146.0	67.4
...	...	...	...	...	...
197	Vietnam	198	6	25.0	90.3
198	West Bank and Gaza	199	1	NaN	NaN
199	Yemen	200	1	83.0	54.1
200	Zambia	201	3	161.0	68.0
201	Zimbabwe	202	3	101.0	89.5

202 rows × 5 columns

[24]:

# Dos columnas específicas
df_who[["CountryID","Continent"]]

[24]:

	CountryID	Continent
0	1	1
1	2	2
2	3	3
3	4	2
4	5	3
...	...	...
197	198	6
198	199	1
199	200	1
200	201	3
201	202	3

202 rows × 2 columns

[17]:

# Existen dos métodos loc e iloc para acceder a sub-regiones de los datos
# funció: .iloc(filas, columnas)
df2 = df_who.iloc[:,3:5]
print(df2.shape)
print(df2.head())

(202, 2)
   Adolescent fertility rate (%)  Adult literacy rate (%)
0                          151.0                     28.0
1                           27.0                     98.7
2                            6.0                     69.9
3                            NaN                      NaN
4                          146.0                     67.4

Sobre Series

Cuando seleccionamos una columna de un DataFrame, obtenemos una Serie. Las Series tienen ciertas características, como la capacidad de aplicar métodos estadísticos (si son Series numéricas).

[18]:

fertilitat = df_who[df_who.columns[3]]

fertilitat

[18]:

0      151.0
1       27.0
2        6.0
3        NaN
4      146.0
       ...
197     25.0
198      NaN
199     83.0
200    161.0
201    101.0
Name: Adolescent fertility rate (%), Length: 202, dtype: float64

[19]:

print("Min ", fertilitat.min()) # a Pandas el concepte d'iterar "no té sentit"
print("Max ", fertilitat.max())
print("Count ", fertilitat.count())

Min  0.0
Max  199.0
Count  177

Tabla con las funciones descriptivas 2d9a4998ca5b4f69b609d6047a5fce84

Veremos que obtener esta información estadística nos puede ayudar a extraer información muy concreta de la tabla, por ejemplo, si queremos saber:

¿Qué país tiene la mayor emisión de CO2?

[20]:

co2 = df_who["Total_CO2_emissions"]
co2.max()

[20]:

5776431.5

[28]:

df_who[co2==co2.max()]["Country"].values[0]

[28]:

'United States of America'

[50]:

co2 = df_who["Total_CO2_emissions"]
co2

[50]:

0         692.50
1        3499.12
2      137535.56
3            NaN
4        8991.46
         ...
197    101826.23
198       655.86
199     20148.34
200      2366.94
201     11457.33
Name: Total_CO2_emissions, Length: 202, dtype: float64

[51]:

co2.max()

[51]:

5776431.5

[55]:

co2==co2.max()

[55]:

0      False
1      False
2      False
3      False
4      False
       ...
197    False
198    False
199    False
200    False
201    False
Name: Total_CO2_emissions, Length: 202, dtype: bool

[56]:

df_who[co2==co2.max()]

[56]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
192	United States of America	193	4	43.0	NaN	44070.0	93.0	91.0	302841.0	1.0	...	5776431.5	1.100000e+13	NaN	-7.140000e+11	8.0	7.1	8.0	240000000.0	1.39	80.8

1 rows × 358 columns

[57]:

df_who[co2==co2.max()]["Country"]

[57]:

192    United States of America
Name: Country, dtype: object

[58]:

df_who[co2==co2.max()]["Country"].values

[58]:

array(['United States of America'], dtype=object)

[59]:

df_who[co2==co2.max()]["Country"].values[0]

[59]:

'United States of America'

[64]:

# También existen métodos para seleccionar de manera aleatoria muestras dentro de una serie
# Tip: Métodos montecarlo
fertilidad = df_who[df_who.columns[3]]
some = fertilidad.sample(n=3)
print(some)
fertilidad.sample(n=3,random_state=2) #on random_state és la llavor/seed del aleatori

112     NaN
24     71.0
15     22.0
Name: Adolescent fertility rate (%), dtype: float64

[64]:

53      48.0
149     28.0
13     135.0
Name: Adolescent fertility rate (%), dtype: float64

Filas

Cada fila tiene un índice. El índice puede ser numérico, alfabético o temporal.

[65]:

df_who.index

[65]:

RangeIndex(start=0, stop=202, step=1)

[66]:

df_who.index.values

[66]:

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199, 200, 201])

Así como seleccionamos columnas, podemos seleccionar información con para obtener filas. Para realizar la consulta de una fila concreta usaremos el atributo loc de los dataframes.

[30]:

import pandas as pd

df = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe

[31]:

print(df.loc[0])

Country                           Afghanistan
CountryID                                   1
Continent                                   1
Adolescent fertility rate (%)           151.0
Adult literacy rate (%)                  28.0
                                     ...
Under_five_mortality_from_IHME          231.9
Under_five_mortality_rate               257.0
Urban_population                    5740436.0
Urban_population_growth                  5.44
Urban_population_pct_of_total            22.9
Name: 0, Length: 358, dtype: object

Si lo que necesitamos es obtener son los valores, necesitaremos el atributo values:

[3]:

df.loc[0].values

[3]:

array(['Afghanistan', 1, 1, 151.0, 28.0, nan, nan, nan, 26088.0, 4.0,
       23.0, nan, 16.0, 4.0, 47.0, 6.0, 7.2, nan, nan, nan, 14.0, nan,
       nan, nan, nan, nan, nan, 10.3, 73.0, 70.0, 83.0, 83.0, nan, 66.0,
       90.0, nan, nan, nan, nan, nan, 20.1, 27.5, 4.4, 4.0, nan, nan,
       900.0, nan, nan, 14930.0, nan, 900.0, 5970.0, 5.0, nan, 97.2, 8.0,
       6.0, 29.0, 23.0, nan, 2.0, 72.5, 0.0, nan, 2.5, 0.0, 5.4, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, 473.0, 443.0, 500.0, 153.0,
       706.0, 134.0, 1269.0, 18.9, 0.3, 1.1, 1.0, 5.9, 26.0, 22.1, 24.8,
       10.0, 32.0, 0.0, 36.0, 36.0, 35.0, 161.0, 165.0, 154.0, 176.0,
       42.0, 43.0, 42.0, 1800.0, 60.0, 16.0, 100.0, 231.0, 257.0, 254.0,
       260.0, 76.0, 6.0, 18.0, 4.6, 59.3, 32.9, nan, 0.01, nan, nan, 17.0,
       22.0, 37.0, 25.0, 30.0, 45.0, nan, nan, nan, nan, 9.8, 3.2, 13.1,
       nan, nan, nan, nan, 58.35, 36.1, nan, 108.83, 2750000000.0, 168.0,
       87.0, 42.29, 0.0, 28000000.0, 2.9, 14.3, 11.7, 26.8, 874.0, 2021.0,
       220.0, 0.000878, 0.02, 0.04, 62300000000.0, 4.0, 600000.0, nan,
       3.6, 6.9, 254.0, 511.0, 96.8, nan, nan, 1792633.0, 7.07, nan, nan,
       nan, nan, 3.3, 2.8, 5.2, 4.5, 193.0, 236.0, 316.0, 377.0, nan,
       10.3, 33.0, nan, -7.0, nan, nan, nan, nan, nan, nan, nan, 12.43,
       nan, nan, nan, nan, 5.19, 0.0, nan, 8670.0, 25.05, nan, 20.0, 4.16,
       1.04, 3.3, 20.0, 5.2, nan, nan, nan, 55.66, nan, 49.0, 39.0, nan,
       874.0, nan, 24.48, nan, 165.0, 76.0, 40.0, 90.0, 11.9, 1.0, 43.4,
       12.59, 43.14, 28.0, 18.39, 50.81, 34.26, 3.5, 2.3, 3.7, 2.5, 147.0,
       218.0, 155.0, 233.0, 11.3, 2.7, 12.2, 2.9, 173.0, 675.0, 190.0,
       732.0, nan, nan, nan, nan, 1900.0, nan, nan, 64.0, 0.19, 39.42,
       9.93, nan, nan, nan, nan, nan, nan, nan, nan, nan, 8242.0, 66826.0,
       nan, nan, nan, nan, nan, nan, 717.04, nan, nan, nan, nan, nan, nan,
       nan, nan, 29900000.0, nan, nan, 37.73, nan, nan, nan, nan, 2.8,
       4.5, 151.0, 249.0, 0.68, 55.57, 36.2, 23.66, 3.14, 39.42, 15.8,
       8.3, 18.5, 9.7, 499.0, 936.0, 592.0, 1108.0, nan, 652090.0, nan,
       692.5, nan, nan, nan, 257.0, 231.9, 257.0, 5740436.0, 5.44, 22.9],
      dtype=object)

[32]:

print(df_who.loc[0].Country)
print(df_who.loc[0][0])
print(df_who.loc[0][3])

Afghanistan
Afghanistan
151.0

Utilizando el atributo loc del dataframe podemos seleccionar y filtrar las filas (y columnas) mediante labels. En el caso de filas, el label es el índice y si éste es númerico podemos usar los slicing típicos de Python.

.loc(row_labels,columns_labels)

Recordemos el slicing:

sublista = lista[start:stop:step]

Dónde: * start: Posición de la lista original dónde empieza la sublista. Si no se indica és 0. * stop: Posición de la lista original hasta donde seleccionar. Se selecciona hasta la posición stop - 1. * step: Incremento entre cada índice de la selección, por defecto 1.

Si entendemos el concepto para un array…

[22]:

array =[1,2,3,4,5,6,7,9,0]
print(array[0:2]) #**
print(array[3:]) #**
print(array[:3]) #**

[1, 2]
[4, 5, 6, 7, 9, 0]
[1, 2, 3]

… podemos hacer las mismas operaciones con las filas de un dataFrame.

[4]:

df.loc[4:10:2]

[4]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
4	Angola	5	3	146.0	67.4	3890.0	49.0	51.0	16557.0	2.8	...	8991.46	1.490000e+10	27.13	9.140000e+09	164.1	242.5	164.1	8578749.0	4.14	53.3
6	Argentina	7	5	62.0	97.2	11670.0	98.0	99.0	39134.0	1.0	...	152711.86	3.140000e+11	21.11	1.190000e+10	18.1	16.7	18.1	34900000.0	1.17	90.1
8	Australia	9	6	16.0	NaN	33940.0	97.0	96.0	20530.0	1.1	...	368858.53	4.680000e+11	NaN	-1.280000e+10	5.9	5.1	5.9	18000000.0	1.54	88.2
10	Azerbaijan	11	2	31.0	98.8	5430.0	83.0	86.0	8406.0	0.6	...	36629.01	9.930000e+09	64.89	1.330000e+09	50.0	63.7	50.0	4321803.0	1.26	51.5

4 rows × 358 columns

También se puede realizar una selección particular mediante una lista:

[5]:

df.loc[[3,10,29,34]]

[5]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
3	Andorra	4	2	NaN	NaN	NaN	83.0	83.0	74.0	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	Azerbaijan	11	2	31.0	98.8	5430.0	83.0	86.0	8406.0	0.6	...	36629.01	9.930000e+09	64.89	1.330000e+09	50.0	63.7	50.0	4321803.0	1.26	51.5
29	Cambodia	30	7	52.0	73.6	1550.0	89.0	91.0	14197.0	1.7	...	538.61	5.680000e+09	32.94	-5.470000e+08	97.3	94.3	97.3	2749235.0	4.58	19.7
34	Chad	35	3	193.0	25.7	1170.0	49.0	71.0	10468.0	3.1	...	139.23	2.790000e+09	14.17	-2.210000e+08	208.2	180.1	208.2	2566839.0	4.88	25.3

4 rows × 358 columns

Selección de filas y columnas

Si seguimos con la misma lógica, usando el atributo loc de los dataFrames.

[6]:

import pandas as pd

df = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe

[11]:

df.loc[0:1]

[11]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
0	Afghanistan	1	1	151.0	28.0	NaN	NaN	NaN	26088.0	4.0	...	692.50	NaN	NaN	NaN	257.00	231.9	257.00	5740436.0	5.44	22.9
1	Albania	2	2	27.0	98.7	6000.0	93.0	94.0	3172.0	0.6	...	3499.12	4.790000e+09	78.14	-2.040000e+09	18.47	15.5	18.47	1431793.9	2.21	45.4

2 rows × 358 columns

Las columnas se deben seleccionar con una lista que debe contener el nombre de las columnas deseadas.

[12]:

df.loc[0:1,["Continent"]]

[12]:

	Continent
0	1
1	2

[13]:

df.loc[0:3,["Continent","Total_CO2_emissions"]]

[13]:

	Continent	Total_CO2_emissions
0	1	692.50
1	2	3499.12
2	3	137535.56
3	2	NaN

[15]:

df.loc[[0,3,20,100],["Country","Continent","Total_CO2_emissions"]]

[15]:

	Country	Continent	Total_CO2_emissions
0	Afghanistan	1	692.50
3	Andorra	2	NaN
20	Bhutan	7	414.03
100	Liberia	3	472.66

Alternativamente, con el atributo iloc podemos seleccionar las columnas con su índice numérico: su posicion en la lista de columnas.

[16]:

df.iloc[0]

[16]:

Country                           Afghanistan
CountryID                                   1
Continent                                   1
Adolescent fertility rate (%)           151.0
Adult literacy rate (%)                  28.0
                                     ...
Under_five_mortality_from_IHME          231.9
Under_five_mortality_rate               257.0
Urban_population                    5740436.0
Urban_population_growth                  5.44
Urban_population_pct_of_total            22.9
Name: 0, Length: 358, dtype: object

[17]:

df.iloc[0:4, 3:7] # ídem a una matriz

[17]:

	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)
0	151.0	28.0	NaN	NaN
1	27.0	98.7	6000.0	93.0
2	6.0	69.9	5940.0	94.0
3	NaN	NaN	NaN	83.0

[18]:

df.iloc[0][0]

[18]:

'Afghanistan'

[19]:

df.iloc[0][0:4]

[19]:

Country                          Afghanistan
CountryID                                  1
Continent                                  1
Adolescent fertility rate (%)          151.0
Name: 0, dtype: object

[20]:

df.iloc[0][0:4].values

[20]:

array(['Afghanistan', 1, 1, 151.0], dtype=object)

[21]:

df.iloc[0][0:4].values[0]

[21]:

'Afghanistan'

Selección condicional

Además de la selección con base a índices, lo interesante es realizar selecciones mediante condiciones lógicas que permiten filtrar las filas del dataset. Por ejemplo:

[34]:

df_who = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe

alfabetitzacio = df_who[df_who['Adult literacy rate (%)'] > 70][["Country","Adult literacy rate (%)"]]

alfabetitzacio[alfabetitzacio["Country"] == "Italy"]

[34]:

	Country	Adult literacy rate (%)
85	Italy	98.4

[8]:

df_who['Adult literacy rate (%)'] > 70

[8]:

0      False
1       True
2      False
3      False
4      False
       ...
197     True
198    False
199    False
200    False
201     True
Name: Adult literacy rate (%), Length: 202, dtype: bool

[37]:

seleccio = df_who['Adult literacy rate (%)'] > 70

[38]:

# En aquest codi, de les files on seleccio == True agafam les dues columnes que ens interessen
df_who[seleccio][["Country","Adult literacy rate (%)"]]

[38]:

	Country	Adult literacy rate (%)
1	Albania	98.7
6	Argentina	97.2
7	Armenia	99.4
10	Azerbaijan	98.8
12	Bahrain	86.5
...	...	...
193	Uruguay	96.8
195	Vanuatu	75.5
196	Venezuela	93.0
197	Vietnam	90.3
201	Zimbabwe	89.5

93 rows × 2 columns

[39]:

# Multiples criterios
# CO_emisions y Fertilidad
import numpy as np
ix = np.where((df_who["Total_CO2_emissions"] > 10) & (df_who[df_who.columns[3]] <=0.6))
ix

[39]:

(array([92]),)

Estadísticas en un DataFrame

[35]:

# Creamos un dataframe  con valores aleatorios
import numpy as np

np.random.seed(10)

df = pd.DataFrame({"one":np.random.randint(-10,10,5),
                  "two":np.random.random(5),
                  "three":np.random.randint(0,5,5)})
df

[35]:

	one	two	three
0	-1	0.748804	0
1	-6	0.498507	2
2	5	0.224797	0
3	-10	0.198063	4
4	7	0.760531	3

[36]:

df.T

[36]:

	0	1	2	3	4
one	-1.000000	-6.000000	5.000000	-10.000000	7.000000
two	0.748804	0.498507	0.224797	0.198063	0.760531
three	0.000000	2.000000	0.000000	4.000000	3.000000

[12]:

df.sum()

[12]:

one     -5.000000
two      2.430701
three    9.000000
dtype: float64

[13]:

df.sum(axis=1) # concepto de axis

[13]:

0    -0.251196
1    -3.501493
2     5.224797
3    -5.801937
4    10.760531
dtype: float64

[14]:

df.cumsum()

[14]:

	one	two	three
0	-1	0.748804	0
1	-7	1.247311	2
2	-2	1.472108	2
3	-12	1.670170	6
4	-5	2.430701	9

[15]:

df.cumsum(axis=1)

[15]:

	one	two	three
0	-1.0	-0.251196	-0.251196
1	-6.0	-5.501493	-3.501493
2	5.0	5.224797	5.224797
3	-10.0	-9.801937	-5.801937
4	7.0	7.760531	10.760531

[16]:

df.apply(np.abs,axis=1).cumsum() # función apply

[16]:

	one	two	three
0	1	0.748804	0
1	7	1.247311	2
2	12	1.472108	2
3	22	1.670170	6
4	29	2.430701	9

[17]:

df.apply(np.abs).cumsum(axis=1)

[17]:

	one	two	three
0	1.0	1.748804	1.748804
1	6.0	6.498507	8.498507
2	5.0	5.224797	5.224797
3	10.0	10.198063	14.198063
4	7.0	7.760531	10.760531

[18]:

df.cumprod()

[18]:

	one	two
0	-1	0.748804
1	6	0.373284
2	30	0.083913
3	-300	0.016620
4	-2100	0.012640

[19]:

df.cumprod(axis=1)

[19]:

	one	two	three
0	-1.0	-0.748804	-0.000000
1	-6.0	-2.991042	-5.982084
2	5.0	1.123983	0.000000
3	-10.0	-1.980629	-7.922515
4	7.0	5.323715	15.971145

[21]:

ones = np.ones(5) # numpy arrays
print(ones)

[1. 1. 1. 1. 1.]

[22]:

df.sub(ones,axis=0)

[22]:

	one	two	three
0	-2.0	-0.251196	-1.0
1	-7.0	-0.501493	1.0
2	4.0	-0.775203	-1.0
3	-11.0	-0.801937	3.0
4	6.0	-0.239469	2.0

[23]:

df.sub(ones) #Alerta!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb Cell 131 line <cell line: 1>()
----> <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#Y241sZmlsZQ%3D%3D?line=0'>1</a> df.sub(ones)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/ops/__init__.py:436, in flex_arith_method_FRAME.<locals>.f(self, other, axis, level, fill_value)
    433 axis = self._get_axis_number(axis) if axis is not None else 1
    435 other = maybe_prepare_scalar_for_op(other, self.shape)
--> 436 self, other = align_method_FRAME(self, other, axis, flex=True, level=level)
    438 if isinstance(other, ABCDataFrame):
    439     # Another DataFrame
    440     new_data = self._combine_frame(other, na_op, fill_value)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/ops/__init__.py:248, in align_method_FRAME(left, right, axis, flex, level)
    245 if isinstance(right, np.ndarray):
    247     if right.ndim == 1:
--> 248         right = to_series(right)
    250     elif right.ndim == 2:
    251         if right.shape == left.shape:

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/ops/__init__.py:239, in align_method_FRAME.<locals>.to_series(right)
    237 else:
    238     if len(left.columns) != len(right):
--> 239         raise ValueError(
    240             msg.format(req_len=len(left.columns), given_len=len(right))
    241         )
    242     right = left._constructor_sliced(right, index=left.columns)
    243 return right

ValueError: Unable to coerce to Series, length must be 3: given 5

[48]:

# nornalitzación z-score (media a 0 y desviación a 1)
ts_stand = (df - df.mean()) / df.std()
ts_stand

[48]:

	one	two	three
0	0.000000	0.966021	-1.006231
1	-0.696733	0.045482	0.111803
2	0.836080	-0.961166	-1.006231
3	-1.254119	-1.059487	1.229837
4	1.114773	1.009150	0.670820

[49]:

print(ts_stand.mean())
print("----")
print(ts_stand.std())

one      0.000000e+00
two      2.664535e-16
three    6.661338e-17
dtype: float64
----
one      1.0
two      1.0
three    1.0
dtype: float64

Series no númericas

[24]:

df_who["Country"]

[24]:

0             Afghanistan
1                 Albania
2                 Algeria
3                 Andorra
4                  Angola
              ...
197               Vietnam
198    West Bank and Gaza
199                 Yemen
200                Zambia
201              Zimbabwe
Name: Country, Length: 202, dtype: object

[53]:

#https://www.w3schools.com/python/python_ref_string.asp
df_who.Country.str.casefold()

[53]:

0             afghanistan
1                 albania
2                 algeria
3                 andorra
4                  angola
              ...
197               vietnam
198    west bank and gaza
199                 yemen
200                zambia
201              zimbabwe
Name: Country, Length: 202, dtype: object

[25]:

df_who.Country.str.zfill(12)

[25]:

0            0Afghanistan
1            00000Albania
2            00000Algeria
3            00000Andorra
4            000000Angola
              ...
197          00000Vietnam
198    West Bank and Gaza
199          0000000Yemen
200          000000Zambia
201          0000Zimbabwe
Name: Country, Length: 202, dtype: object

Modificación del dataframe

Además de realizar selecciones, en algunos momentos necesitaremos incorporar nueva información a nuestras tablas de datos.

Vamos a crear un pequeño conjunto para practicar:

[37]:

df2 = pd.DataFrame([('Foreign Cinema', 'Restaurant', 289.0),
                   ('Liho Liho', 'Restaurant', 224.0),
                   ('500 Club', 'bar', 80.5),
                   ('The Square', 'bar', 25.30)],
           columns=('name', 'type', 'AvgBill')
                 )
df2

[37]:

	name	type	AvgBill
0	Foreign Cinema	Restaurant	289.0
1	Liho Liho	Restaurant	224.0
2	500 Club	bar	80.5
3	The Square	bar	25.3

Tenemos diversas maneras de añadir columnas a un dataFrame:

Mediante el nombre de la columna que queremos añadir, tal como añadimos una nueva clave a un diccionario.
insert: es un método que necesita 3 parámetros. La posición en la que queremos añadir la columna (loc), su nombre (´column´) y la lista de valores (value).
assign: muy similar a la anterior, pero permite añadir múltiples columnas.
concat: no se suele usar para concatenar columnas, en el caso que queramos usarlo para este caso, deberemos poner el parámetro axis=1.

Veamos algunos ejemplos:

[38]:

df2['Day'] = "Monday" # Como un diccionario
df2

[38]:

	name	type	AvgBill	Day
0	Foreign Cinema	Restaurant	289.0	Monday
1	Liho Liho	Restaurant	224.0	Monday
2	500 Club	bar	80.5	Monday
3	The Square	bar	25.3	Monday

[39]:

df2['Day'] = ['Monday', 'Tuesday', 'Wednesday']
df2

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[39], line 1
----> 1 df2['Day'] = ['Monday', 'Tuesday', 'Wednesday']
      2 df2

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/frame.py:3980, in DataFrame.__setitem__(self, key, value)
   3977     self._setitem_array([key], value)
   3978 else:
   3979     # set column
-> 3980     self._set_item(key, value)

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/frame.py:4174, in DataFrame._set_item(self, key, value)
   4164 def _set_item(self, key, value) -> None:
   4165     """
   4166     Add series to DataFrame in specified column.
   4167
   (...)
   4172     ensure homogeneity.
   4173     """
-> 4174     value = self._sanitize_column(value)
   4176     if (
   4177         key in self.columns
   4178         and value.ndim == 1
   4179         and not is_extension_array_dtype(value)
   4180     ):
   4181         # broadcast across multiple columns if necessary
   4182         if not self.columns.is_unique or isinstance(self.columns, MultiIndex):

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/frame.py:4915, in DataFrame._sanitize_column(self, value)
   4912     return _reindex_for_setitem(Series(value), self.index)
   4914 if is_list_like(value):
-> 4915     com.require_length_match(value, self.index)
   4916 return sanitize_array(value, self.index, copy=True, allow_2d=True)

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/common.py:571, in require_length_match(data, index)
    567 """
    568 Check the length of data matches the length of the index.
    569 """
    570 if len(data) != len(index):
--> 571     raise ValueError(
    572         "Length of values "
    573         f"({len(data)}) "
    574         "does not match length of index "
    575         f"({len(index)})"
    576     )

ValueError: Length of values (3) does not match length of index (4)

[40]:

#Vamos a usar el método insert
df2.insert(loc=1, column="Stars", value=[2,2,3,4])
df2

[40]:

	name	Stars	type	AvgBill	Day
0	Foreign Cinema	2	Restaurant	289.0	Monday
1	Liho Liho	2	Restaurant	224.0	Monday
2	500 Club	3	bar	80.5	Monday
3	The Square	4	bar	25.3	Monday

[42]:

df2["AvgBillIVA"] = df2["AvgBill"]*1.21
df2

[42]:

	name	Stars	type	AvgBill	Day	AvgBillIVA
0	Foreign Cinema	2	Restaurant	289.0	Monday	349.690
1	Liho Liho	2	Restaurant	224.0	Monday	271.040
2	500 Club	3	bar	80.5	Monday	97.405
3	The Square	4	bar	25.3	Monday	30.613

[41]:

df3 = df2.assign(AvgHalfBill=df2.AvgBill / 2, Michelin_Star=3)
df3

df3["HOLA"] = df3.name.str.capitalize()

[34]:

df3

[34]:

	name	Stars	type	AvgBill	Day	AvgHalfBill	Michelin_Star	HOLA
0	Foreign Cinema	2	Restaurant	289.0	Monday	144.50	3	Foreign cinema
1	Liho Liho	2	Restaurant	224.0	Tuesday	112.00	3	Liho liho
2	500 Club	3	bar	80.5	Wednesday	40.25	3	500 club
3	The Square	4	bar	25.3	Thursday	12.65	3	The square

Para añadir filas a un dataframe tenemos dos métodos que realizan la tarea y devuelven un nuevo dataFrame:

append: añade una fila a un dataFrame. La fila puede ser un diccionario, una Serie o otro Dataframe. El parámetro ignore_index = True significa que se ignorará el índice de la serie o el dataFrame de origen. En su lugar, se utilizará el índice disponible en el dataFrame de destino. El valor False significa lo contrario.
concat: concatena dos o más dataFrames separados por comas.

Tenemos el método drop que nos proporciona un nuevo dataFrame sin la(s) fila(s) o la(s) columna(s) que seleccionemos. Si queremos eliminar columnas podemos hacerlo especificando la lista de columnas en el parámetro columns de la siguiente manera:

[43]:

df3

[43]:

	name	Stars	type	AvgBill	Day	AvgHalfBill	Michelin_Star	HOLA
0	Foreign Cinema	2	Restaurant	289.0	Monday	144.50	3	Foreign cinema
1	Liho Liho	2	Restaurant	224.0	Monday	112.00	3	Liho liho
2	500 Club	3	bar	80.5	Monday	40.25	3	500 club
3	The Square	4	bar	25.3	Monday	12.65	3	The square

[49]:

df3.drop(columns=["Stars"],inplace=True) # Eliminamos la última columna que hemos creado

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[49], line 1
----> 1 df3.drop(columns=["Stars"],inplace=True) # Eliminamos la última columna que hemos creado

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/frame.py:5399, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   5251 @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"])
   5252 def drop(  # type: ignore[override]
   5253     self,
   (...)
   5260     errors: IgnoreRaise = "raise",
   5261 ) -> DataFrame | None:
   5262     """
   5263     Drop specified labels from rows or columns.
   5264
   (...)
   5397             weight  1.0     0.8
   5398     """
-> 5399     return super().drop(
   5400         labels=labels,
   5401         axis=axis,
   5402         index=index,
   5403         columns=columns,
   5404         level=level,
   5405         inplace=inplace,
   5406         errors=errors,
   5407     )

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/generic.py:4505, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   4503 for axis, labels in axes.items():
   4504     if labels is not None:
-> 4505         obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4507 if inplace:
   4508     self._update_inplace(obj)

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/generic.py:4546, in NDFrame._drop_axis(self, labels, axis, level, errors, only_slice)
   4544         new_axis = axis.drop(labels, level=level, errors=errors)
   4545     else:
-> 4546         new_axis = axis.drop(labels, errors=errors)
   4547     indexer = axis.get_indexer(new_axis)
   4549 # Case for non-unique axis
   4550 else:

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/indexes/base.py:6934, in Index.drop(self, labels, errors)
   6932 if mask.any():
   6933     if errors != "ignore":
-> 6934         raise KeyError(f"{list(labels[mask])} not found in axis")
   6935     indexer = indexer[~mask]
   6936 return self.delete(indexer)

KeyError: "['Stars'] not found in axis"

[48]:

df3

[48]:

	name	type	AvgBill	Day	AvgHalfBill	Michelin_Star	HOLA
0	Foreign Cinema	Restaurant	289.0	Monday	144.50	3	Foreign cinema
1	Liho Liho	Restaurant	224.0	Monday	112.00	3	Liho liho
2	500 Club	bar	80.5	Monday	40.25	3	500 club
3	The Square	bar	25.3	Monday	12.65	3	The square

Para poder eliminar filas, usamos la misma función, esta vez sin el parámetro que hemos usado anteriormente, simplemente indicamos los índices a eliminar:

[29]:

df_less_rows = df_no_michelin.drop([1,3])
df_less_rows

[29]:

	name	Stars	type	AvgBill	Day	AvgHalfBill
0	Foreign Cinema	2	Restaurant	289.0	Monday	144.50
2	500 Club	3	bar	80.5	Wednesday	40.25

[30]:

dfs1 = df3[df3.Stars>=3] # per una selecció?
dfs1

[30]:

	name	Stars	type	AvgBill	Day	AvgHalfBill	Michelin_Star
2	500 Club	3	bar	80.5	Wednesday	40.25	3
3	The Square	4	bar	25.3	Thursday	12.65	3

[31]:

df3[df3.Stars>=3].index

[31]:

Int64Index([2, 3], dtype='int64')

[32]:

df3.drop(df3[df3.Stars>=3].index)

[32]:

	name	Stars	type	AvgBill	Day	AvgHalfBill	Michelin_Star
0	Foreign Cinema	2	Restaurant	289.0	Monday	144.5	3
1	Liho Liho	2	Restaurant	224.0	Tuesday	112.0	3

[41]:

df3.drop(df3[df3.Stars>=3].index, inplace=True) #Alerta con la integración de los cambios
df3

[41]:

	name	Stars	type	AvgBill	Day	AvgHalfBill
0	Foreign Cinema	2	Restaurant	289.0	Monday	144.50
2	500 Club	3	bar	80.5	Wednesday	40.25

Concatenación y unión de dataframes

A veces, los datos vienen en diferentes archivos y necesitan ser combinados en un único archivo, este proceso implica la concatenación de dataframes. Otras veces, los datos son complementarios, es decir, hay nuevas columnas en un dataframe, y esto se conoce como realizar operaciones de unión (joins).

[50]:

df1 = pd.DataFrame([('Foreign Cinema', 'Restaurant', 289.0),
                   ('Liho Liho', 'Restaurant', 224.0),
                   ('500 Club', 'bar', 80.5),
                   ('The Square', 'bar', 25.30)],
           columns=('name', 'type', 'AvgBill')
                 )
df1

[50]:

	name	type	AvgBill
0	Foreign Cinema	Restaurant	289.0
1	Liho Liho	Restaurant	224.0
2	500 Club	bar	80.5
3	The Square	bar	25.3

[51]:

df2 = pd.DataFrame([('Biels', 'Quiosquet', 389.0),
                   ('Barkus', 'Bar', 24.0),
                   ('Blue Wall', 'bar', 80.5),
                   ('Bounty Hunters', 'Social Club', 125.30)],
           columns=('name2', 'type', 'AvgBill')
                 )
df2

[51]:

	name2	type	AvgBill
0	Biels	Quiosquet	389.0
1	Barkus	Bar	24.0
2	Blue Wall	bar	80.5
3	Bounty Hunters	Social Club	125.3

[53]:

dfAll = pd.concat([df1,df2])
dfAll

[53]:

	name	type	AvgBill	name2
0	Foreign Cinema	Restaurant	289.0	NaN
1	Liho Liho	Restaurant	224.0	NaN
2	500 Club	bar	80.5	NaN
3	The Square	bar	25.3	NaN
0	NaN	Quiosquet	389.0	Biels
1	NaN	Bar	24.0	Barkus
2	NaN	bar	80.5	Blue Wall
3	NaN	Social Club	125.3	Bounty Hunters

[42]:

dfAll = dfAll.reset_index()
dfAll

[42]:

	index	name	type	AvgBill
0	0	Foreign Cinema	Restaurant	289.0
1	1	Liho Liho	Restaurant	224.0
2	2	500 Club	bar	80.5
3	3	The Square	bar	25.3
4	0	Biels	Quiosquet	389.0
5	1	Barkus	Bar	24.0
6	2	Blue Wall	bar	80.5
7	3	Bounty Hunters	Social Club	125.3

[ ]:

dfAll.drop(columns=["index"])

[ ]:

# https://pandas.pydata.org/docs/reference/api/pandas.concat.html
dfAll = pd.concat([df1,df2],ignore_index=True)
dfAll

Más información: https://pandas.pydata.org/docs/user_guide/merging.html

[54]:

df1 = pd.DataFrame([('Jhon', 1, "Italy"),
                   ('Pep', 2, "Germany"),
                   ('William', 3, "Finland"),
                   ('Snake', 4, "Italy")],
           columns=('name', 'ID', 'Country')
                 )
df1

[54]:

	name	ID	Country
0	Jhon	1	Italy
1	Pep	2	Germany
2	William	3	Finland
3	Snake	4	Italy

[55]:

df2 = pd.DataFrame([(1, 145.0, 3000.1),
                   ( 2, 189.2, 2030.2),
                   ( 3, 129.0, 3000.0),
                   ( 4, 198.1, 4020.2)],
           columns=('DNI', 'Weight', 'Salary')
                 )
df2

[55]:

	DNI	Weight	Salary
0	1	145.0	3000.1
1	2	189.2	2030.2
2	3	129.0	3000.0
3	4	198.1	4020.2

[ ]:

# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

df1.merge(df2) ## ALERTA ! Necesita más parametros

---------------------------------------------------------------------------
MergeError                                Traceback (most recent call last)
Cell In[56], line 3
      1 # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
----> 3 df1.merge(df2)

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/frame.py:10093, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
  10074 @Substitution("")
  10075 @Appender(_merge_doc, indents=2)
  10076 def merge(
   (...)
  10089     validate: str | None = None,
  10090 ) -> DataFrame:
  10091     from pandas.core.reshape.merge import merge
> 10093     return merge(
  10094         self,
  10095         right,
  10096         how=how,
  10097         on=on,
  10098         left_on=left_on,
  10099         right_on=right_on,
  10100         left_index=left_index,
  10101         right_index=right_index,
  10102         sort=sort,
  10103         suffixes=suffixes,
  10104         copy=copy,
  10105         indicator=indicator,
  10106         validate=validate,
  10107     )

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/reshape/merge.py:110, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     93 @Substitution("\nleft : DataFrame or named Series")
     94 @Appender(_merge_doc, indents=0)
     95 def merge(
   (...)
    108     validate: str | None = None,
    109 ) -> DataFrame:
--> 110     op = _MergeOperation(
    111         left,
    112         right,
    113         how=how,
    114         on=on,
    115         left_on=left_on,
    116         right_on=right_on,
    117         left_index=left_index,
    118         right_index=right_index,
    119         sort=sort,
    120         suffixes=suffixes,
    121         indicator=indicator,
    122         validate=validate,
    123     )
    124     return op.get_result(copy=copy)

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/reshape/merge.py:685, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate)
    681     # stacklevel chosen to be correct when this is reached via pd.merge
    682     # (and not DataFrame.join)
    683     warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())
--> 685 self.left_on, self.right_on = self._validate_left_right_on(left_on, right_on)
    687 cross_col = None
    688 if self.how == "cross":

File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/reshape/merge.py:1434, in _MergeOperation._validate_left_right_on(self, left_on, right_on)
   1432 common_cols = left_cols.intersection(right_cols)
   1433 if len(common_cols) == 0:
-> 1434     raise MergeError(
   1435         "No common columns to perform merge on. "
   1436         f"Merge options: left_on={left_on}, "
   1437         f"right_on={right_on}, "
   1438         f"left_index={self.left_index}, "
   1439         f"right_index={self.right_index}"
   1440     )
   1441 if (
   1442     not left_cols.join(common_cols, how="inner").is_unique
   1443     or not right_cols.join(common_cols, how="inner").is_unique
   1444 ):
   1445     raise MergeError(f"Data columns not unique: {repr(common_cols)}")

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

[48]:

df1.merge(df2, left_on='ID', right_on='DNI')

[48]:

	name	ID	Country	DNI	Weight	Salary
0	Jhon	1	Italy	1	145.0	3000.1
1	Pep	2	Germany	2	189.2	2030.2
2	William	3	Finland	3	129.0	3000.0
3	Snake	4	Italy	4	198.1	4020.2

[57]:

df2 = pd.DataFrame([(1, 145.0, 3000.1), #No 2
                   ( 3, 129.0, 3000.0), # Multiples 3
                   ( 3, 159.0, 4000.0),
                   ( 3, 109.0, 5000.0),
                   ( 4, 198.1, 4020.2),
                   ( 5, 200.0, 5000.2)], #a new one
           columns=('DNI', 'Weight', 'Salary')
                 )
df2

[57]:

	DNI	Weight	Salary
0	1	145.0	3000.1
1	3	129.0	3000.0
2	3	159.0	4000.0
3	3	109.0	5000.0
4	4	198.1	4020.2
5	5	200.0	5000.2

[58]:

df1.merge(df2, left_on='ID', right_on='DNI')

[58]:

	name	ID	Country	DNI	Weight	Salary
0	Jhon	1	Italy	1	145.0	3000.1
1	William	3	Finland	3	129.0	3000.0
2	William	3	Finland	3	159.0	4000.0
3	William	3	Finland	3	109.0	5000.0
4	Snake	4	Italy	4	198.1	4020.2

[51]:

df1.merge(df2, left_on='ID', right_on='DNI',how="left")

[51]:

	name	ID	Country	DNI	Weight	Salary
0	Jhon	1	Italy	1.0	145.0	3000.1
1	Pep	2	Germany	NaN	NaN	NaN
2	William	3	Finland	3.0	129.0	3000.0
3	William	3	Finland	3.0	159.0	4000.0
4	William	3	Finland	3.0	109.0	5000.0
5	Snake	4	Italy	4.0	198.1	4020.2

[52]:

df1.merge(df2, left_on='ID', right_on='DNI',how="right")

[52]:

	name	ID	Country	DNI	Weight	Salary
0	Jhon	1.0	Italy	1	145.0	3000.1
1	William	3.0	Finland	3	129.0	3000.0
2	William	3.0	Finland	3	159.0	4000.0
3	William	3.0	Finland	3	109.0	5000.0
4	Snake	4.0	Italy	4	198.1	4020.2
5	NaN	NaN	NaN	5	200.0	5000.2

[53]:

df1.merge(df2, left_on='ID', right_on='DNI',how="inner")

[53]:

	name	ID	Country	DNI	Weight	Salary
0	Jhon	1	Italy	1	145.0	3000.1
1	William	3	Finland	3	129.0	3000.0
2	William	3	Finland	3	159.0	4000.0
3	William	3	Finland	3	109.0	5000.0
4	Snake	4	Italy	4	198.1	4020.2

[54]:

df1.merge(df2,how="cross")

[54]:

	name	ID	Country	DNI	Weight	Salary
0	Jhon	1	Italy	1	145.0	3000.1
1	Jhon	1	Italy	3	129.0	3000.0
2	Jhon	1	Italy	3	159.0	4000.0
3	Jhon	1	Italy	3	109.0	5000.0
4	Jhon	1	Italy	4	198.1	4020.2
5	Jhon	1	Italy	5	200.0	5000.2
6	Pep	2	Germany	1	145.0	3000.1
7	Pep	2	Germany	3	129.0	3000.0
8	Pep	2	Germany	3	159.0	4000.0
9	Pep	2	Germany	3	109.0	5000.0
10	Pep	2	Germany	4	198.1	4020.2
11	Pep	2	Germany	5	200.0	5000.2
12	William	3	Finland	1	145.0	3000.1
13	William	3	Finland	3	129.0	3000.0
14	William	3	Finland	3	159.0	4000.0
15	William	3	Finland	3	109.0	5000.0
16	William	3	Finland	4	198.1	4020.2
17	William	3	Finland	5	200.0	5000.2
18	Snake	4	Italy	1	145.0	3000.1
19	Snake	4	Italy	3	129.0	3000.0
20	Snake	4	Italy	3	159.0	4000.0
21	Snake	4	Italy	3	109.0	5000.0
22	Snake	4	Italy	4	198.1	4020.2
23	Snake	4	Italy	5	200.0	5000.2

Usando el fichero who.csv, se pide:

[62]:

# read file csv
import pandas as pd
df_who = pd.read_csv("data/WHO.csv")
df_who.head()

[62]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
0	Afghanistan	1	1	151.0	28.0	NaN	NaN	NaN	26088.0	4.0	...	692.50	NaN	NaN	NaN	257.00	231.9	257.00	5740436.0	5.44	22.9
1	Albania	2	2	27.0	98.7	6000.0	93.0	94.0	3172.0	0.6	...	3499.12	4.790000e+09	78.14	-2.040000e+09	18.47	15.5	18.47	1431793.9	2.21	45.4
2	Algeria	3	3	6.0	69.9	5940.0	94.0	96.0	33351.0	1.5	...	137535.56	6.970000e+10	351.36	4.700000e+09	40.00	31.2	40.00	20800000.0	2.61	63.3
3	Andorra	4	2	NaN	NaN	NaN	83.0	83.0	74.0	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Angola	5	3	146.0	67.4	3890.0	49.0	51.0	16557.0	2.8	...	8991.46	1.490000e+10	27.13	9.140000e+09	164.10	242.5	164.10	8578749.0	4.14	53.3

5 rows × 358 columns

1) ¿Cuál és la media de la población urbana (“Urban_population”) de todos los países? ¿Su desviación típica (std)?

[64]:

df_who["Urban_population"].mean()

[64]:

16657626.767446807

[65]:

df_who["Urban_population"].std()

[65]:

50948665.823935635

2) Consulta la fila del país: “Spain”

[ ]:

df_who[df_who["Country"]=="Spain"]

3a) ¿Qué país tiene una mayor población urbana?

[70]:

df_who[df_who["Urban_population"] == df_who["Urban_population"].max()]

[70]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
36	China	37	7	3.0	90.9	4660.0	96.0	100.0	1328474.0	0.6	...	5547757.5	1.890000e+12	295.23	1.250000e+11	27.3	27.8	27.3	527000000.0	2.95	40.4

1 rows × 358 columns

3b) ¿Qué paises tienen una población urbana menor a 50000 ?

[71]:

df_who[df_who["Urban_population"] < 50000    ]

[71]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
5	Antigua and Barbuda	6	4	NaN	NaN	15130.0	NaN	NaN	84.0	1.3	...	421.36	830000000.0	NaN	-102000000.0	12.60	NaN	12.60	32468.25	2.25	39.1
69	Grenada	70	5	53.0	NaN	8770.0	83.0	84.0	106.0	0.3	...	234.50	414000000.0	23.19	-220000000.0	22.00	NaN	22.00	32589.00	0.45	30.6
91	Kiribati	92	6	NaN	NaN	6230.0	98.0	96.0	94.0	1.7	...	25.65	51900000.0	NaN	-20800000.0	66.00	NaN	66.00	46926.00	3.08	47.4
151	Saint Kitts and Nevis	152	5	NaN	NaN	12440.0	78.0	64.0	50.0	1.3	...	135.57	387000000.0	24.28	-84300000.0	21.00	NaN	21.00	15456.00	1.77	32.2
152	Saint Lucia	153	5	51.0	NaN	8500.0	97.0	99.0	163.0	1.1	...	370.06	764000000.0	27.39	-119000000.0	17.70	14.0	17.70	45482.32	1.15	27.6
154	Samoa	155	6	45.0	98.6	5090.0	91.0	90.0	185.0	0.8	...	150.22	286000000.0	12.46	-116000000.0	29.98	NaN	29.98	41181.28	1.18	22.4
160	Seychelles	161	3	NaN	91.8	14360.0	100.0	99.0	86.0	0.7	...	578.91	563000000.0	8.34	-187000000.0	13.54	NaN	13.54	43854.10	1.20	52.9
182	Tonga	183	6	17.0	98.9	5470.0	94.0	97.0	100.0	0.5	...	117.25	167000000.0	57.30	-94600000.0	24.48	NaN	24.48	23846.64	1.04	24.0

8 rows × 358 columns

4) ¿El continente donde está situado Spain es el mismo que el de `United States of America?

Utiliza una condición para obtener un resultado Booleano (True o False)

[72]:

sample_spain = df_who[df_who["Country"]=="Spain"]
sample_spain

[72]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
168	Spain	169	2	10.0	97.2	28200.0	99.0	100.0	43887.0	1.1	...	343701.53	6.780000e+11	NaN	-5.770000e+10	4.9	4.2	4.9	33300000.0	1.75	76.7

1 rows × 358 columns

[73]:

sample_eeuu = df_who[df_who["Country"]=="United States of America"]
sample_eeuu

[73]:

	Country	CountryID	Continent	Adolescent fertility rate (%)	Adult literacy rate (%)	Gross national income per capita (PPP international $)	Net primary school enrolment ratio female (%)	Net primary school enrolment ratio male (%)	Population (in thousands) total	Population annual growth rate (%)	...	Total_CO2_emissions	Total_income	Total_reserves	Trade_balance_goods_and_services	Under_five_mortality_from_CME	Under_five_mortality_from_IHME	Under_five_mortality_rate	Urban_population	Urban_population_growth	Urban_population_pct_of_total
192	United States of America	193	4	43.0	NaN	44070.0	93.0	91.0	302841.0	1.0	...	5776431.5	1.100000e+13	NaN	-7.140000e+11	8.0	7.1	8.0	240000000.0	1.39	80.8

1 rows × 358 columns

[79]:

sample_spain["Continent"].values[0] == sample_eeuu["Continent"].values[0]

[79]:

False

5) ¿Cuáles son los cinco paises más contaminantes (“Total_CO2_emissions”)?

Esta es mi pista para una solución elegante: http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.sort_values.html

[ ]:

6) Observando algunas muestras del fichero puedes establecer la relación entre el identificador del continente y su nombre?

Es decir, sabemos que Spain está en el continente Europeo y el código del continente es el 2.

Existen los códigos de continentes: 1, 2, 3, 4, 5, 6, 7

Nota: Hay dos códigos asociados a Asia.

Haz las consultas pertinentes al dataframe para construir un diccionario con la siguiente estructura:

[ ]:

codigoContinentes = {1:"Asia",2:"Europa"} #Al menos hay 7!
print(codigoContinentes[2])

[ ]:

codigoContinentes = {1:"Asia",2:"Europa",3:"", 4:""} # TODO...

7) Una vez identificado el nombre de los continentes, ¿puedes cambiar la columna de identificadores de continentes por sus respectivos nombres?

Esta es es mi pista para una solución elegante: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html

[ ]:

8) Puedes crear un nuevo dataframe con aquellos paises que sean de Europa?

[ ]:

df2 = df #Con una simple asignación ya creas una dataframe
type(df2)

9) ¿Cuáles son los paises más contaminantes de Europa?

[ ]:

10) Calcula la cantidad de ayuda recibida por cada municipio en función del númeto total de habitantes.

[7]:

import pandas as pd
import random

random.seed(0)

nombres = [f'Municipio{i}' for i in range(1, 11)]

data_municipios = {
    'Nombre': nombres,
    'Código Postal': [random.randint(10000, 99999) for _ in range(10)],
    'Población': [random.randint(1000, 50000) for _ in range(10)]  # Añadimos un atributo aleatorio, en este caso "Población"
}

df_municipios = pd.DataFrame(data_municipios)


data_ayudas = {
    'Nombre':  [f'Municipio{i}' for i in range(1, 11)],
    'Ayuda Económica (en euros)': [random.randint(1000, 10000) for _ in range(10)],
    'Número de Beneficiarios': [random.randint(10, 100) for _ in range(10)]
}

df_ayudas = pd.DataFrame(data_ayudas)

print("Dataframe de Municipios:")
print(df_municipios)

print("\nDataframe de Ayudas:")
print(df_ayudas)

Dataframe de Municipios:
        Nombre  Código Postal  Población
0   Municipio1          60494      39232
1   Municipio2          65125      15315
2   Municipio3          15306      34075
3   Municipio4          43936      10127
4   Municipio5          77013      19470
5   Municipio6          73691      10158
6   Municipio7          63075       7214
7   Municipio8          49755      41525
8   Municipio9          72468      17417
9  Municipio10          56930      35902

Dataframe de Ayudas:
        Nombre  Ayuda Económica (en euros)  Número de Beneficiarios
0   Municipio1                        3407                       88
1   Municipio2                        6081                       91
2   Municipio3                        2618                       36
3   Municipio4                        2208                       80
4   Municipio5                        6409                       71
5   Municipio6                        8735                       66
6   Municipio7                        2649                       76
7   Municipio8                        6796                       43
8   Municipio9                        8113                       17
9  Municipio10                        6180                       80

Usando el fichero climaMallorca.csv, se pide: 11) ¿Cual es la temperatura máxima cuando el viento es inferior a 10? ¿Cuántas muestras hay?

[ ]:

12) ¿Cual es la temperatura máxima cuando el viento es superior a 10 y inferior a 20? ¿Cuántas muestras hay?

[ ]:

Isaac Lera and Gabriel Moya Universitat de les Illes Balears isaac.lera@uib.edu, gabriel.moya@uib.edu