Pandas
Pandas es la librería por excelencia para el análisis de datos del lenguaje Python
. Su nombre proviene de “panel data” (término econométrico). Inspirada en las funcionalidades de R
, pero con el potencial de este lenguaje de propósito general.
Pandas incluye todas las funcionalidades necesarias para el proceso de análisis de datos: carga, filtrado, tratamiento, síntesis, agrupamiento, almacenamiento y visualización. Además, se integra con el resto de librerías de cálculo numérico como Numpy
, Matplotlib
, scikit-learn
, … y de despliegue: HPC
, Cloud
, etc.
En resumen, es como una hoja de cálculo -por ejemplo excel- pero con más mucho más potencial!!!
Todo el trabajo que realizaremos es sobre la estructura de datos básica: el dataFrame
.
Un dataFrame
es un objeto de dos dimensiones que contiene información. También puede verse como una hoja de cálculo, como una tabla de un modelo entidad-relación, o como una colección de una base de datos no relacional.
[2]:
%pip install pandas
Requirement already satisfied: pandas in /Users/isaac/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/isaac/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/isaac/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages (from pandas) (2023.3)
Requirement already satisfied: numpy>=1.21.0 in /Users/isaac/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages (from pandas) (1.23.5)
Requirement already satisfied: six>=1.5 in /Users/isaac/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Importación de la libreria
import pandas as pd
import pandas
from pandas import *
Por convención se hace de la siguiente manera, todas las funciones de la libreria se tienen que llamar con el prefijo pd.*:
[1]:
import pandas as pd
Carga de datos
Vamos a aprender Pandas a través de una serie de proyectos y ejemplos. En esta primera fase, vamos a cargar datos de un fichero CSV, recordad que son ficheros donde los atributos/valores de una observación están separados por una coma y las observaciones se separan mediante un salto de línea.
Podemos descargar los datos con los que trabajaremos del siguiente enlace WHO dataset
Empezamos viendo como se carga un dataframe a partir de un fichero en formato CSV.
[2]:
df = pd.read_csv("data/WHO.csv")
[6]:
df
[6]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 1 | 1 | 151.0 | 28.0 | NaN | NaN | NaN | 26088.0 | 4.0 | ... | 692.50 | NaN | NaN | NaN | 257.00 | 231.9 | 257.00 | 5740436.0 | 5.44 | 22.9 |
1 | Albania | 2 | 2 | 27.0 | 98.7 | 6000.0 | 93.0 | 94.0 | 3172.0 | 0.6 | ... | 3499.12 | 4.790000e+09 | 78.14 | -2.040000e+09 | 18.47 | 15.5 | 18.47 | 1431793.9 | 2.21 | 45.4 |
2 | Algeria | 3 | 3 | 6.0 | 69.9 | 5940.0 | 94.0 | 96.0 | 33351.0 | 1.5 | ... | 137535.56 | 6.970000e+10 | 351.36 | 4.700000e+09 | 40.00 | 31.2 | 40.00 | 20800000.0 | 2.61 | 63.3 |
3 | Andorra | 4 | 2 | NaN | NaN | NaN | 83.0 | 83.0 | 74.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Angola | 5 | 3 | 146.0 | 67.4 | 3890.0 | 49.0 | 51.0 | 16557.0 | 2.8 | ... | 8991.46 | 1.490000e+10 | 27.13 | 9.140000e+09 | 164.10 | 242.5 | 164.10 | 8578749.0 | 4.14 | 53.3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
197 | Vietnam | 198 | 6 | 25.0 | 90.3 | 2310.0 | 91.0 | 96.0 | 86206.0 | 1.4 | ... | 101826.23 | 4.480000e+10 | 47.11 | -1.940000e+09 | 20.20 | 23.4 | 20.20 | 21900000.0 | 2.90 | 26.4 |
198 | West Bank and Gaza | 199 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 655.86 | 3.780000e+09 | NaN | NaN | 28.00 | 25.8 | 28.00 | 2596216.0 | 3.33 | 71.6 |
199 | Yemen | 200 | 1 | 83.0 | 54.1 | 2090.0 | 65.0 | 85.0 | 21732.0 | 3.0 | ... | 20148.34 | 1.150000e+10 | 114.52 | 8.310000e+08 | 82.40 | 87.9 | 82.40 | 5759120.5 | 4.37 | 27.3 |
200 | Zambia | 201 | 3 | 161.0 | 68.0 | 1140.0 | 94.0 | 90.0 | 11696.0 | 1.9 | ... | 2366.94 | 4.090000e+09 | 10.41 | -4.470000e+08 | 175.30 | 163.8 | 175.30 | 4017411.0 | 1.95 | 35.0 |
201 | Zimbabwe | 202 | 3 | 101.0 | 89.5 | NaN | 88.0 | 87.0 | 13228.0 | 0.8 | ... | 11457.33 | 5.620000e+09 | 3.39 | -1.710000e+08 | 106.50 | 67.0 | 106.50 | 4709965.0 | 1.90 | 35.9 |
202 rows × 358 columns
A continuación se muestra la estructura interna del DataFrame. Se puede ver que és muy parecido a una tabla bidimensional:
[3]:
df
[3]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 1 | 1 | 151.0 | 28.0 | NaN | NaN | NaN | 26088.0 | 4.0 | ... | 692.50 | NaN | NaN | NaN | 257.00 | 231.9 | 257.00 | 5740436.0 | 5.44 | 22.9 |
1 | Albania | 2 | 2 | 27.0 | 98.7 | 6000.0 | 93.0 | 94.0 | 3172.0 | 0.6 | ... | 3499.12 | 4.790000e+09 | 78.14 | -2.040000e+09 | 18.47 | 15.5 | 18.47 | 1431793.9 | 2.21 | 45.4 |
2 | Algeria | 3 | 3 | 6.0 | 69.9 | 5940.0 | 94.0 | 96.0 | 33351.0 | 1.5 | ... | 137535.56 | 6.970000e+10 | 351.36 | 4.700000e+09 | 40.00 | 31.2 | 40.00 | 20800000.0 | 2.61 | 63.3 |
3 | Andorra | 4 | 2 | NaN | NaN | NaN | 83.0 | 83.0 | 74.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Angola | 5 | 3 | 146.0 | 67.4 | 3890.0 | 49.0 | 51.0 | 16557.0 | 2.8 | ... | 8991.46 | 1.490000e+10 | 27.13 | 9.140000e+09 | 164.10 | 242.5 | 164.10 | 8578749.0 | 4.14 | 53.3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
197 | Vietnam | 198 | 6 | 25.0 | 90.3 | 2310.0 | 91.0 | 96.0 | 86206.0 | 1.4 | ... | 101826.23 | 4.480000e+10 | 47.11 | -1.940000e+09 | 20.20 | 23.4 | 20.20 | 21900000.0 | 2.90 | 26.4 |
198 | West Bank and Gaza | 199 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 655.86 | 3.780000e+09 | NaN | NaN | 28.00 | 25.8 | 28.00 | 2596216.0 | 3.33 | 71.6 |
199 | Yemen | 200 | 1 | 83.0 | 54.1 | 2090.0 | 65.0 | 85.0 | 21732.0 | 3.0 | ... | 20148.34 | 1.150000e+10 | 114.52 | 8.310000e+08 | 82.40 | 87.9 | 82.40 | 5759120.5 | 4.37 | 27.3 |
200 | Zambia | 201 | 3 | 161.0 | 68.0 | 1140.0 | 94.0 | 90.0 | 11696.0 | 1.9 | ... | 2366.94 | 4.090000e+09 | 10.41 | -4.470000e+08 | 175.30 | 163.8 | 175.30 | 4017411.0 | 1.95 | 35.0 |
201 | Zimbabwe | 202 | 3 | 101.0 | 89.5 | NaN | 88.0 | 87.0 | 13228.0 | 0.8 | ... | 11457.33 | 5.620000e+09 | 3.39 | -1.710000e+08 | 106.50 | 67.0 | 106.50 | 4709965.0 | 1.90 | 35.9 |
202 rows × 358 columns
df.shape
[3]:
df.shape
[3]:
(202, 358)
[4]:
df.columns
[4]:
Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
'Adult literacy rate (%)',
'Gross national income per capita (PPP international $)',
'Net primary school enrolment ratio female (%)',
'Net primary school enrolment ratio male (%)',
'Population (in thousands) total', 'Population annual growth rate (%)',
...
'Total_CO2_emissions', 'Total_income', 'Total_reserves',
'Trade_balance_goods_and_services', 'Under_five_mortality_from_CME',
'Under_five_mortality_from_IHME', 'Under_five_mortality_rate',
'Urban_population', 'Urban_population_growth',
'Urban_population_pct_of_total'],
dtype='object', length=358)
[6]:
df.index
[6]:
RangeIndex(start=0, stop=202, step=1)
Atributos de un DataFrame
Un dataframe dispone de diferentes atributos con los que podemos obtener su información o metainformación. Los siguientes ejemplos muestran cómo se pueden consultar sus dimensiones o un listado del nombre de sus columnas:
[4]:
df.shape # Ver las dimensiones **
[4]:
(202, 358)
[11]:
df.columns[:5]
[11]:
Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
'Adult literacy rate (%)'],
dtype='object')
Podemos aplicar sobre el listado de columnas todas las operaciones sobre listas que hemos visto en la introducción del curso. A continuación tenemos dos ejemplos de indexación:
[12]:
df.columns[-2]
[12]:
'Urban_population_growth'
[13]:
df.columns[0:5]
[13]:
Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
'Adult literacy rate (%)'],
dtype='object')
[14]:
df.columns[-5:-1]
[14]:
Index(['Under_five_mortality_from_IHME', 'Under_five_mortality_rate',
'Urban_population', 'Urban_population_growth'],
dtype='object')
¿Cómo consultariais el nombre de la columna 10? ¿y los de las columnas de la 200 a la 225?
[12]:
len(df.columns)
[12]:
358
[8]:
df.columns[200:226]
df.columns[len(df.columns)-3:len(df.columns)]
[8]:
Index(['Urban_population', 'Urban_population_growth',
'Urban_population_pct_of_total'],
dtype='object')
Funciones descriptivas de un dataframe
Pandas
ofrece una colección de funciones que permiten realizar una inspección general de la tabla de datos:
describe: muestra estadísticas descriptivas básicas para todas las columnas numéricas.
info: muestra todas las columnas y sus tipos de datos.
head i tail: muestra las \(n\) primeras/últimas filas. El valor de \(n\) es un parámetro de este método.
[15]:
df.describe()
[15]:
CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | Population in urban areas (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 202.000000 | 202.000000 | 177.000000 | 131.000000 | 178.000000 | 179.000000 | 179.000000 | 1.930000e+02 | 193.000000 | 193.000000 | ... | 1.860000e+02 | 1.780000e+02 | 128.000000 | 1.710000e+02 | 181.000000 | 170.000000 | 181.000000 | 1.880000e+02 | 188.000000 | 188.000000 |
mean | 101.500000 | 3.579208 | 59.457627 | 78.871756 | 11250.112360 | 84.033520 | 85.698324 | 3.409805e+04 | 1.297927 | 54.911917 | ... | 1.483596e+05 | 2.015567e+11 | 57.253516 | 3.424012e+08 | 56.677624 | 54.356471 | 56.677624 | 1.665763e+07 | 2.165851 | 55.195213 |
std | 58.456537 | 1.808263 | 49.105286 | 20.415760 | 12586.753417 | 17.788047 | 15.451212 | 1.304957e+05 | 1.163864 | 23.554182 | ... | 6.133091e+05 | 9.400689e+11 | 138.669298 | 5.943043e+10 | 60.060929 | 61.160556 | 60.060929 | 5.094867e+07 | 1.596628 | 23.742122 |
min | 1.000000 | 1.000000 | 0.000000 | 23.600000 | 260.000000 | 6.000000 | 11.000000 | 2.000000e+00 | -2.500000 | 10.000000 | ... | 2.565000e+01 | 5.190000e+07 | 0.990000 | -7.140000e+11 | 2.900000 | 3.000000 | 2.900000 | 1.545600e+04 | -1.160000 | 10.000000 |
25% | 51.250000 | 2.000000 | 19.000000 | 68.400000 | 2112.500000 | 79.000000 | 79.500000 | 1.340000e+03 | 0.500000 | 36.000000 | ... | 1.672615e+03 | 3.317500e+09 | 16.292500 | -1.210000e+09 | 12.400000 | 8.475000 | 12.400000 | 9.171623e+05 | 1.105000 | 35.650000 |
50% | 101.500000 | 3.000000 | 46.000000 | 86.500000 | 6175.000000 | 90.000000 | 90.000000 | 6.762000e+03 | 1.300000 | 57.000000 | ... | 1.021157e+04 | 1.145000e+10 | 28.515000 | -2.240000e+08 | 29.980000 | 27.600000 | 29.980000 | 3.427661e+06 | 1.945000 | 57.300000 |
75% | 151.750000 | 5.000000 | 91.000000 | 95.300000 | 14502.500000 | 96.000000 | 96.000000 | 2.173200e+04 | 2.100000 | 73.000000 | ... | 6.549217e+04 | 8.680000e+10 | 55.310000 | 1.024000e+09 | 88.700000 | 82.900000 | 88.700000 | 9.837113e+06 | 3.252500 | 72.750000 |
max | 202.000000 | 7.000000 | 199.000000 | 99.800000 | 60870.000000 | 100.000000 | 100.000000 | 1.328474e+06 | 4.300000 | 100.000000 | ... | 5.776432e+06 | 1.100000e+13 | 1334.860000 | 1.390000e+11 | 267.000000 | 253.700000 | 267.000000 | 5.270000e+08 | 7.850000 | 100.000000 |
8 rows × 357 columns
[13]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Columns: 358 entries, Country to Urban_population_pct_of_total
dtypes: float64(355), int64(2), object(1)
memory usage: 565.1+ KB
[17]:
df.head(2)
[17]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 1 | 1 | 151.0 | 28.0 | NaN | NaN | NaN | 26088.0 | 4.0 | ... | 692.50 | NaN | NaN | NaN | 257.00 | 231.9 | 257.00 | 5740436.0 | 5.44 | 22.9 |
1 | Albania | 2 | 2 | 27.0 | 98.7 | 6000.0 | 93.0 | 94.0 | 3172.0 | 0.6 | ... | 3499.12 | 4.790000e+09 | 78.14 | -2.040000e+09 | 18.47 | 15.5 | 18.47 | 1431793.9 | 2.21 | 45.4 |
2 rows × 358 columns
[21]:
df.tail(2)
[21]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
200 | Zambia | 201 | 3 | 161.0 | 68.0 | 1140.0 | 94.0 | 90.0 | 11696.0 | 1.9 | ... | 2366.94 | 4.090000e+09 | 10.41 | -447000000.0 | 175.3 | 163.8 | 175.3 | 4017411.0 | 1.95 | 35.0 |
201 | Zimbabwe | 202 | 3 | 101.0 | 89.5 | NaN | 88.0 | 87.0 | 13228.0 | 0.8 | ... | 11457.33 | 5.620000e+09 | 3.39 | -171000000.0 | 106.5 | 67.0 | 106.5 | 4709965.0 | 1.90 | 35.9 |
2 rows × 358 columns
[20]:
df.tail(2)
[20]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
200 | Zambia | 201 | 3 | 161.0 | 68.0 | 1140.0 | 94.0 | 90.0 | 11696.0 | 1.9 | ... | 2366.94 | 4.090000e+09 | 10.41 | -447000000.0 | 175.3 | 163.8 | 175.3 | 4017411.0 | 1.95 | 35.0 |
201 | Zimbabwe | 202 | 3 | 101.0 | 89.5 | NaN | 88.0 | 87.0 | 13228.0 | 0.8 | ... | 11457.33 | 5.620000e+09 | 3.39 | -171000000.0 | 106.5 | 67.0 | 106.5 | 4709965.0 | 1.90 | 35.9 |
2 rows × 358 columns
Carga de datos (segunda parte)
Desafortunadamente, la estructura y codificación de los datos en los archivos CSV varía según la herramienta o el sistema operativo. Por lo tanto, podemos encontrarnos con separadores entre columnas que no sean la típica coma (‘,’) o formatos de codificación de texto que no sean abiertos (por ejemplo, utf-8, ansi, etc.).
Es por esto que la función read_csv
es muy versátil. Puedes consultar su documentación.
Vamos a ver qué sucede cuando se obtienen datos de una administración pública. Puedes encontrar el archivo disponible en: ‘data/presupuesto_gastos_2023.csv’. Puedes acceder a estos datos a través del siguiente enlace.
[20]:
df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv",encoding="cp1250",sep=";") # quins errors genera ?
[21]:
df_gastos
[21]:
_id | SOZIETATEA_EU/SOCIEDAD_EU | SOZIETATEA_CAS/SOCIEDAD_CAS | EKITALDIA/EJERCICIO | SAILA/DEPARTAMENTO | SAILAREN DESKRIBAPENA_EU/DESCRIPCION DEPARTAMENTO_EU | SAILAREN DESKRIBAPENA_EU/DESCRIPCION DEPARTAMENTO_CAS | ZENTRU KUDEATZAILEA/CENTRO GESTOR | ZENTRO KUDEATZAILEAREN DESKR._EUS/DESCR. CENTRO GESTOR_EUS | ZENTRO KUDEATZAILEAREN DESKR._CAS/DESCR. CENTRO GESTOR_CAS | ... | ARTIKULUAREN DESKRIBAPENA_CAS/DESCRIPCION ARTICULO_CAS | KONTZEPTUA/CONCEPTO | KONTZEPTUAREN DESKRIBAPENA_EUS/DESCRIPCION CONCEPTO_EUS | KONTZEPTUAREN DESKRIBAPENA_CAS/DESCRIPCION CONCEPTO_CAS | AZPIKONTZEPTUA/SUBCONCEPTO | AZPIKONTZEPTUAREN DESKRIBAPENA_EUS/DESCRIPCION SUBCONCEPTO_EUS | AZPIKONTZEPTUAREN DESKRIBAPENA_CAS/DESCRIPCION SUBCONCEPTO_CAS | PROIEKTUA/PROYECTO | PROIEKTUAREN DESKRIBAPENA/DESCRIPCION PROYECTO | HASIERAKO KREDITUA/CREDITO INICIAL 2023 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | UDAL | UDAL | 2023 | 110 | KULTURA ETA GOBERNANTZA | CULTURA Y GOBERNANZA | 1100 | KULTURA ETA GOBERNANTZA | CULTURA Y GOBERNANZA | ... | RETRIBUCIONES DE ALTOS CARGOS | 100 | GOI KARGUEN OINARRIZKO SOLDATAK ETA BESTELAKO ... | RETRIBUCIONES BASICAS Y OTRAS REMUNERACIONES D... | 10001 | GOI KARGUEN OINARRIZKO SOLDATAK ETA BESTELAKO ... | RETRIBUCIONES BASICAS Y OTRAS REMUNERACIONES D... | 9999/99999 | GENERIKOA/ GENÉRICO | 220.619,00 |
1 | 2 | UDAL | UDAL | 2023 | 110 | KULTURA ETA GOBERNANTZA | CULTURA Y GOBERNANZA | 1100 | KULTURA ETA GOBERNANTZA | CULTURA Y GOBERNANZA | ... | RETRIBUCIONES DEL PERSONAL EVENTUAL DE GABINETES | 110 | KABINETEETAKO ALDI BAT. PERTSONALAREN OINARRIZ... | RETRIBUCIONES BASICAS Y OTRAS REMUNERACIONES D... | 11001 | KABINETEETAKO ALDI BAT. PERTSONALAREN OINARRIZ... | RETRIBUCIONES BASICAS Y OTRAS REMUNERACIONES D... | 9999/99999 | GENERIKOA/ GENÉRICO | 589.261,00 |
2 | 3 | UDAL | UDAL | 2023 | 110 | KULTURA ETA GOBERNANTZA | CULTURA Y GOBERNANZA | 1100 | KULTURA ETA GOBERNANTZA | CULTURA Y GOBERNANZA | ... | RETRIBUCIONES DEL PERSONAL FUNCIONARIO | 120 | FUNTZIONARIOEN OINARRIZKO SOLDATAK | RETRIBUCIONES BASICAS DEL PERSONAL FUNCIONARIO | 12001 | FUNTZIONARIOEN OINARRIZKO SOLDATAK | RETRIBUCIONES BASICAS DEL PERSONAL FUNCIONARIO | 9999/99999 | GENERIKOA/ GENÉRICO | 383.369,00 |
3 | 4 | UDAL | UDAL | 2023 | 110 | KULTURA ETA GOBERNANTZA | CULTURA Y GOBERNANZA | 1100 | KULTURA ETA GOBERNANTZA | CULTURA Y GOBERNANZA | ... | RETRIBUCIONES DEL PERSONAL FUNCIONARIO | 121 | FUNTZIONARIOEN ORDAINSARI OSAGARRIAK | RETRIBUCIONES COMPLEMENTARIAS DEL PERSONAL FUN... | 12101 | LANTOKI OSAGARRIA | COMPLEMENTO DE DESTINO | 9999/99999 | GENERIKOA/ GENÉRICO | 131.722,00 |
4 | 5 | UDAL | UDAL | 2023 | 110 | KULTURA ETA GOBERNANTZA | CULTURA Y GOBERNANZA | 1100 | KULTURA ETA GOBERNANTZA | CULTURA Y GOBERNANZA | ... | RETRIBUCIONES DEL PERSONAL FUNCIONARIO | 121 | FUNTZIONARIOEN ORDAINSARI OSAGARRIAK | RETRIBUCIONES COMPLEMENTARIAS DEL PERSONAL FUN... | 12102 | OSAGARRI BEREZIA | COMPLEMENTO ESPECIFICO | 9999/99999 | GENERIKOA/ GENÉRICO | 396.609,00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2088 | 2089 | UDAL | UDAL | 2023 | A10 | ALKATETZAKO KABINETEA | GABINETE DE ALCALDÍA | A101 | PROTOKOLOKO ETA HARREMAN PUBLIKOEN BULEGOA | OFICINA PROTOCOLO, RR.PP. E NSTITUCIONALES | ... | MATERIAL, SUMINISTROS Y OTROS | 227 | KANPOKO ENPRESEK EGINDAKO LANAK | TRABAJOS REALIZADOS POR OTRAS EMPRESAS EXTERNAS | 22717 | BIDAI AGENTZIETAKO ZERBITZUAK | SERVICIOS DE AGENCIAS DE VIAJES | 2006/00184 | Harreman Publikoetako Kabineteari lotutako gas... | 500,00 |
2089 | 2090 | UDAL | UDAL | 2023 | A10 | ALKATETZAKO KABINETEA | GABINETE DE ALCALDÍA | A101 | PROTOKOLOKO ETA HARREMAN PUBLIKOEN BULEGOA | OFICINA PROTOCOLO, RR.PP. E NSTITUCIONALES | ... | MATERIAL, SUMINISTROS Y OTROS | 227 | KANPOKO ENPRESEK EGINDAKO LANAK | TRABAJOS REALIZADOS POR OTRAS EMPRESAS EXTERNAS | 22720 | ITZULPEN ETA INTERPRETARITZA ZERBITZUAK | SERVICIOS DE TRADUCCION E INTERPRETES | 2006/00182 | Protokoloko eta ordezkaritzako arreta/Harreman... | 2.000,00 |
2090 | 2091 | UDAL | UDAL | 2023 | A10 | ALKATETZAKO KABINETEA | GABINETE DE ALCALDÍA | A101 | PROTOKOLOKO ETA HARREMAN PUBLIKOEN BULEGOA | OFICINA PROTOCOLO, RR.PP. E NSTITUCIONALES | ... | MATERIAL, SUMINISTROS Y OTROS | 227 | KANPOKO ENPRESEK EGINDAKO LANAK | TRABAJOS REALIZADOS POR OTRAS EMPRESAS EXTERNAS | 22723 | ZERBITZUAK:ARGIAK ETA SOINUAK | SERVICIOS DE ILUMINACIÓN Y SONIDO | 2006/00182 | Protokoloko eta ordezkaritzako arreta/Harreman... | 10.000,00 |
2091 | 2092 | UDAL | UDAL | 2023 | A10 | ALKATETZAKO KABINETEA | GABINETE DE ALCALDÍA | A101 | PROTOKOLOKO ETA HARREMAN PUBLIKOEN BULEGOA | OFICINA PROTOCOLO, RR.PP. E NSTITUCIONALES | ... | MATERIAL, SUMINISTROS Y OTROS | 227 | KANPOKO ENPRESEK EGINDAKO LANAK | TRABAJOS REALIZADOS POR OTRAS EMPRESAS EXTERNAS | 22799 | KANPOKO ENPRESEK EGINDAKO BESTELAKO LANAK | OTROS TRABAJOS REALIZADOS POR EMPRESAS EXTERNAS | 2006/00182 | Protokoloko eta ordezkaritzako arreta/Harreman... | 48.400,00 |
2092 | 2093 | UDAL | UDAL | 2023 | A10 | ALKATETZAKO KABINETEA | GABINETE DE ALCALDÍA | A101 | PROTOKOLOKO ETA HARREMAN PUBLIKOEN BULEGOA | OFICINA PROTOCOLO, RR.PP. E NSTITUCIONALES | ... | INDEMNIZACIONES POR RAZON DEL SERVICIO | 230 | BIDAI SARIAK, PERTSONALAREN GARRAIOA ETA LOKOM... | DIETAS, LOCOMOCION Y TRASLADO DEL PERSONAL | 23001 | BIDAI SARIAK, PERTSONALAREN GARRAIOA ETA LOKOM... | DIETAS, LOCOMOCION Y TRASLADO DEL PERSONAL | 2006/00184 | Harreman Publikoetako Kabineteari lotutako gas... | 500,00 |
2093 rows × 37 columns
Codificación de caracteres
Python utilitza una representación basada en Unicode (https://home.unicode.org/). Otros sistemas operativos y programas utilizan otro tipo de representaciones.
var = "camión"
var = "lul·lià"
var = "Ζεύς"
var = "ประเทศไทย"
var = "日本語で"
Codificaciones: - listado de codificaciones - UTF-8
[ ]:
# cp1250 | windows-1250 | Central and Eastern Europe
df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv",sep=";") #Aun tenemos otro fallo, vamos a verlo
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb Cell 32 line <cell line: 3>()
<a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#X35sZmlsZQ%3D%3D?line=0'>1</a> # cp1250 | windows-1250 | Central and Eastern Europe
----> <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#X35sZmlsZQ%3D%3D?line=2'>3</a> df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv",sep=";")
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
209 else:
210 kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
325 if len(args) > num_allow_args:
326 warnings.warn(
327 msg.format(arguments=_format_argument_list(allow_args)),
328 FutureWarning,
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:950, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
935 kwds_defaults = _refine_defaults_read(
936 dialect,
937 delimiter,
(...)
946 defaults={"delimiter": ","},
947 )
948 kwds.update(kwds_defaults)
--> 950 return _read(filepath_or_buffer, kwds)
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:605, in _read(filepath_or_buffer, kwds)
602 _validate_names(kwds.get("names", None))
604 # Create the parser.
--> 605 parser = TextFileReader(filepath_or_buffer, **kwds)
607 if chunksize or iterator:
608 return parser
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1442, in TextFileReader.__init__(self, f, engine, **kwds)
1439 self.options["has_index_names"] = kwds["has_index_names"]
1441 self.handles: IOHandles | None = None
-> 1442 self._engine = self._make_engine(f, self.engine)
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1753, in TextFileReader._make_engine(self, f, engine)
1750 raise ValueError(msg)
1752 try:
-> 1753 return mapping[engine](f, **self.options)
1754 except Exception:
1755 if self.handles is not None:
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py:79, in CParserWrapper.__init__(self, src, **kwds)
76 kwds.pop(key, None)
78 kwds["dtype"] = ensure_dtype_objs(kwds.get("dtype", None))
---> 79 self._reader = parsers.TextReader(src, **kwds)
81 self.unnamed_cols = self._reader.unnamed_cols
83 # error: Cannot determine type of 'names'
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:547, in pandas._libs.parsers.TextReader.__cinit__()
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:636, in pandas._libs.parsers.TextReader._get_header()
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:852, in pandas._libs.parsers.TextReader._tokenize_rows()
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:1965, in pandas._libs.parsers.raise_parser_error()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 1697: invalid start byte
[ ]:
df_gastos
[6]:
df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv",delimiter="\t",encoding="cp1250")
Otras maneras de cargar datos
[22]:
# por la dirección del fichero en web
df_who = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe
[23]:
df_who.head(2)
[23]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 1 | 1 | 151.0 | 28.0 | NaN | NaN | NaN | 26088.0 | 4.0 | ... | 692.50 | NaN | NaN | NaN | 257.00 | 231.9 | 257.00 | 5740436.0 | 5.44 | 22.9 |
1 | Albania | 2 | 2 | 27.0 | 98.7 | 6000.0 | 93.0 | 94.0 | 3172.0 | 0.6 | ... | 3499.12 | 4.790000e+09 | 78.14 | -2.040000e+09 | 18.47 | 15.5 | 18.47 | 1431793.9 | 2.21 | 45.4 |
2 rows × 358 columns
[33]:
# extracción de contenido en una página web
url = "https://es.wikipedia.org/wiki/Anexo:Comunidades_y_ciudades_aut%C3%B3nomas_de_Espa%C3%B1a"
comunidades_esp = pd.io.html.read_html(url)
comunidades_esp[0] # Alerta!
[33]:
Unnamed: 0 | Nombre | Capital (de iure o en su defecto de facto) | Población (2022) | Porcentaje población | Densidad (hab./km²) | Superficie (km²) | Porcentaje superficie | Mapa | PIB per cápita en € (2021) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Andalucía | Sevilla | NaN | 17,87 % | 9705 | NaN | 17,31 % | NaN | NaN |
1 | 2 | Cataluña | Barcelona | NaN | 16,38 % | 23884 | NaN | 6,35 % | NaN | NaN |
2 | 3 | Comunidad de Madrid | Madrid | NaN | 14,24 % | 84115 | NaN | 1,59 % | NaN | NaN |
3 | 4 | Comunidad Valenciana | Valencia | NaN | 10,67 % | 21698 | NaN | 4,60 % | NaN | NaN |
4 | 5 | Galicia | Santiago de Compostela | NaN | 5,68 % | 9119 | NaN | 5,84 % | NaN | NaN |
5 | 6 | Castilla y León | Valladolid[nota 1] | NaN | 5,02 % | 2534 | NaN | 18,62 % | NaN | NaN |
6 | 7 | País Vasco | Vitoria[nota 1] | NaN | 4,67 % | 30213 | NaN | 1,43 % | NaN | NaN |
7 | 8 | Canarias | Las Palmas de Gran Canaria y Santa Cruz de Ten... | NaN | 4,58 % | 30139 | [nota 3] | 1,47 % | NaN | NaN |
8 | 9 | Castilla-La Mancha | Toledo[nota 1] | NaN | 4,32 % | 2579 | NaN | 15,70 % | NaN | NaN |
9 | 10 | Región de Murcia | Murcia | NaN | 3,20 % | 13374 | NaN | 2,24 % | NaN | NaN |
10 | 11 | Aragón | Zaragoza | NaN | 2,79 % | 2790 | NaN | 9,43 % | NaN | NaN |
11 | 12 | Islas Baleares | Palma | NaN | 2,47 % | 24427 | NaN | 0,99 % | NaN | NaN |
12 | 13 | Extremadura | Mérida | NaN | 2,23 % | 2541 | NaN | 8,23 % | NaN | NaN |
13 | 14 | Principado de Asturias | Oviedo | NaN | 2,14 % | 9553 | NaN | 2,13 % | NaN | NaN |
14 | 15 | Comunidad Foral de Navarra | Pamplona | NaN | 1,39 % | 6330 | NaN | 2,05 % | NaN | NaN |
15 | 16 | Cantabria | Santander | NaN | 1,23 % | 10974 | NaN | 1,05 % | NaN | NaN |
16 | 17 | La Rioja | Logroño | NaN | 0,67 % | 6267 | NaN | 1,00 % | NaN | NaN |
17 | 18 | Melilla | Melilla | NaN | 0,18 % | 700158 | NaN | <0,01 % | NaN | NaN |
18 | 19 | Ceuta | Ceuta | NaN | 0,18 % | 417510 | NaN | <0,01 % | NaN | NaN |
19 | TOTAL | España | Madrid | NaN | 100 % | 9367 | NaN | 100 % | — | NaN |
[10]:
print(type(comunidades_esp[0]))
df_comunidades_esp = comunidades_esp[0]
df_comunidades_esp.head()
<class 'pandas.core.frame.DataFrame'>
[10]:
Unnamed: 0 | Nombre | Capital (de iure o en su defecto de facto) | Población (2022) | Porcentaje población | Densidad (hab./km²) | Superficie (km²) | Porcentaje superficie | Mapa | PIB per cápita en € (2021) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Andalucía | Sevilla | NaN | 17,87 % | 9705 | NaN | 17,31 % | NaN | NaN |
1 | 2 | Cataluña | Barcelona | NaN | 16,38 % | 23884 | NaN | 6,35 % | NaN | NaN |
2 | 3 | Comunidad de Madrid | Madrid | NaN | 14,24 % | 84115 | NaN | 1,59 % | NaN | NaN |
3 | 4 | Comunidad Valenciana | Valencia | NaN | 10,67 % | 21698 | NaN | 4,60 % | NaN | NaN |
4 | 5 | Galicia | Santiago de Compostela | NaN | 5,68 % | 9119 | NaN | 5,84 % | NaN | NaN |
[34]:
# Contenido JSON
#Fuente: https://data.cityofnewyork.us/Environment/Urban-Park-Ranger-Animal-Condition-Response/fuhs-xmg2
url = 'https://data.cityofnewyork.us/resource/s3vf-x992.json'
df = pd.read_json(url)
df.head()
[34]:
date_and_time_of_initial | date_and_time_of_ranger | borough | property | location | species_description | call_source | species_status | animal_condition | duration_of_response | ... | _311sr_number | final_ranger_action | of_animals | pep_response | animal_monitored | police_response | esu_response | acc_intake_number | hours_spent_monitoring | rehabilitator | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-06-23T16:45:00.000 | 2021-06-24T08:00:00.000 | Brooklyn | Sternberg Park | Inside locked athletic field under construction | Chukar | Other | Exotic | Healthy | 6.00 | ... | 311-06712416 | ACC | 6 | False | False | False | False | 163537 | NaN | NaN |
1 | 2021-06-24T10:00:00.000 | 2021-06-24T11:00:00.000 | Bronx | Haffen Park | Haffen Pool | Sparrow | Central | Native | Healthy | 1.75 | ... | 311-06714879 | Rehabilitator | 4 | False | False | False | False | NaN | NaN | NaN |
2 | 2021-06-23T14:30:00.000 | 2021-06-23T14:30:00.000 | Bronx | Pelham Bay Park | Pelham Bay South | White-tailed Deer | Employee | Native | N/A | 1.00 | ... | NaN | Unfounded | 0 | False | False | False | False | NaN | NaN | NaN |
3 | 2021-06-23T13:00:00.000 | 2021-06-23T13:10:00.000 | Staten Island | Willowbrook Park | The carousel | Raccoon | Employee | Native | N/A | 2.00 | ... | NaN | Unfounded | 0 | False | False | False | False | NaN | NaN | NaN |
4 | 2021-06-23T09:20:00.000 | 2021-06-23T09:20:00.000 | Queens | Judge Moses Weinstein Playground | Garbage can | Virginia Opossum | Central | Native | Healthy | 2.25 | ... | 311-06699415 | ACC | 1 | False | False | False | False | 119833 | NaN | NaN |
5 rows × 22 columns
Actividades
En esta actividad practicaremos la carga de datos en diferentes formatos. En el mundo real, los datos no siempre tienen una estructura y un formato como desearíamos.
El objetivo es que analices la carga de estos datos con los datos originales: - ¿Qué dimensión tienen los datos reales y los cargados? - ¿Cuáles son las columnas? - ¿El concepto de columna como atributo o característica y el concepto de fila como muestra están presentes en la estructura de los datos? - ¿Coinciden con la información del archivo?
[13]:
# 1. Descarga y carga el siguiente fichero:
# https://data.cityofnewyork.us/Housing-Development/Speculation-Watch-List/adax-9mit
[14]:
# 2. Descarga y carga el siguiente fichero:
# https://ibestat.caib.es/ibestat/estadistiques/poblacio/naixements/414cab4f-b402-4cd1-af05-6617443de384
[15]:
# 3. Y ahora Descargalo y abrelo comprimido!!!. Es decir, sin descomprimir!
# https://ec.europa.eu/eurostat/databrowser/view/tin00171/default/table?lang=en
# Los archivos comprimidos en formato .gz se pueden abrir directamente como si fueran archivos de datos con Pandas, y en este caso, son archivos del tipo CSV. Es decir, no es necesario descomprimirlos.
Guardando un dataframe
Durante este curso aprenderemos a modificar los dataframes, agregaremos y eliminaremos columnas, y también modificaremos las que ya tenemos. Por lo tanto, después de realizar este trabajo, es necesario guardar los nuevos datos en un archivo.
[ ]:
df_gastos.to_csv('data/tmp_file.csv',encoding='utf-8') # guardant un dataframe en un fitxer, especificant el format
Estructura del dataframe
Ahora que ya sabemos cargar dataframes desde archivos, descubriremos cómo podemos acceder a la información que se encuentra dentro de los tipos de variables utilizados por Pandas.
Un dataframe tiene columnas y filas. Las filas son muestras y las columnas representan características de una muestra. Una columna es del tipo Serie.
El objetivo de esta unidad es adquirir herramientas para comprender y seleccionar los datos representados en un dataframe y una serie con Pandas.
Comenzaremos seleccionando columnas y obteniendo resúmenes estadísticos de ellas. Más adelante, pasaremos a realizar selecciones de filas en el dataframe. Finalmente, realizaremos selecciones combinadas creando nuestros propios dataframes a partir de los subconjuntos seleccionados.”
[3]:
import pandas as pd
df_who = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe
[4]:
df_who.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Columns: 358 entries, Country to Urban_population_pct_of_total
dtypes: float64(355), int64(2), object(1)
memory usage: 565.1+ KB
[5]:
df_who.describe()
[5]:
CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | Population in urban areas (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 202.000000 | 202.000000 | 177.000000 | 131.000000 | 178.000000 | 179.000000 | 179.000000 | 1.930000e+02 | 193.000000 | 193.000000 | ... | 1.860000e+02 | 1.780000e+02 | 128.000000 | 1.710000e+02 | 181.000000 | 170.000000 | 181.000000 | 1.880000e+02 | 188.000000 | 188.000000 |
mean | 101.500000 | 3.579208 | 59.457627 | 78.871756 | 11250.112360 | 84.033520 | 85.698324 | 3.409805e+04 | 1.297927 | 54.911917 | ... | 1.483596e+05 | 2.015567e+11 | 57.253516 | 3.424012e+08 | 56.677624 | 54.356471 | 56.677624 | 1.665763e+07 | 2.165851 | 55.195213 |
std | 58.456537 | 1.808263 | 49.105286 | 20.415760 | 12586.753417 | 17.788047 | 15.451212 | 1.304957e+05 | 1.163864 | 23.554182 | ... | 6.133091e+05 | 9.400689e+11 | 138.669298 | 5.943043e+10 | 60.060929 | 61.160556 | 60.060929 | 5.094867e+07 | 1.596628 | 23.742122 |
min | 1.000000 | 1.000000 | 0.000000 | 23.600000 | 260.000000 | 6.000000 | 11.000000 | 2.000000e+00 | -2.500000 | 10.000000 | ... | 2.565000e+01 | 5.190000e+07 | 0.990000 | -7.140000e+11 | 2.900000 | 3.000000 | 2.900000 | 1.545600e+04 | -1.160000 | 10.000000 |
25% | 51.250000 | 2.000000 | 19.000000 | 68.400000 | 2112.500000 | 79.000000 | 79.500000 | 1.340000e+03 | 0.500000 | 36.000000 | ... | 1.672615e+03 | 3.317500e+09 | 16.292500 | -1.210000e+09 | 12.400000 | 8.475000 | 12.400000 | 9.171623e+05 | 1.105000 | 35.650000 |
50% | 101.500000 | 3.000000 | 46.000000 | 86.500000 | 6175.000000 | 90.000000 | 90.000000 | 6.762000e+03 | 1.300000 | 57.000000 | ... | 1.021157e+04 | 1.145000e+10 | 28.515000 | -2.240000e+08 | 29.980000 | 27.600000 | 29.980000 | 3.427661e+06 | 1.945000 | 57.300000 |
75% | 151.750000 | 5.000000 | 91.000000 | 95.300000 | 14502.500000 | 96.000000 | 96.000000 | 2.173200e+04 | 2.100000 | 73.000000 | ... | 6.549217e+04 | 8.680000e+10 | 55.310000 | 1.024000e+09 | 88.700000 | 82.900000 | 88.700000 | 9.837113e+06 | 3.252500 | 72.750000 |
max | 202.000000 | 7.000000 | 199.000000 | 99.800000 | 60870.000000 | 100.000000 | 100.000000 | 1.328474e+06 | 4.300000 | 100.000000 | ... | 5.776432e+06 | 1.100000e+13 | 1334.860000 | 1.390000e+11 | 267.000000 | 253.700000 | 267.000000 | 5.270000e+08 | 7.850000 | 100.000000 |
8 rows × 357 columns
Columnes
Com hem comentat a l’introducció, començarem amb les columnes
[6]:
df_who.head()
[6]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 1 | 1 | 151.0 | 28.0 | NaN | NaN | NaN | 26088.0 | 4.0 | ... | 692.50 | NaN | NaN | NaN | 257.00 | 231.9 | 257.00 | 5740436.0 | 5.44 | 22.9 |
1 | Albania | 2 | 2 | 27.0 | 98.7 | 6000.0 | 93.0 | 94.0 | 3172.0 | 0.6 | ... | 3499.12 | 4.790000e+09 | 78.14 | -2.040000e+09 | 18.47 | 15.5 | 18.47 | 1431793.9 | 2.21 | 45.4 |
2 | Algeria | 3 | 3 | 6.0 | 69.9 | 5940.0 | 94.0 | 96.0 | 33351.0 | 1.5 | ... | 137535.56 | 6.970000e+10 | 351.36 | 4.700000e+09 | 40.00 | 31.2 | 40.00 | 20800000.0 | 2.61 | 63.3 |
3 | Andorra | 4 | 2 | NaN | NaN | NaN | 83.0 | 83.0 | 74.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Angola | 5 | 3 | 146.0 | 67.4 | 3890.0 | 49.0 | 51.0 | 16557.0 | 2.8 | ... | 8991.46 | 1.490000e+10 | 27.13 | 9.140000e+09 | 164.10 | 242.5 | 164.10 | 8578749.0 | 4.14 | 53.3 |
5 rows × 358 columns
[8]:
df_who.shape
[8]:
(202, 358)
[9]:
# Columnas o características de cada muestra
print(df_who.columns)
Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
'Adult literacy rate (%)',
'Gross national income per capita (PPP international $)',
'Net primary school enrolment ratio female (%)',
'Net primary school enrolment ratio male (%)',
'Population (in thousands) total', 'Population annual growth rate (%)',
...
'Total_CO2_emissions', 'Total_income', 'Total_reserves',
'Trade_balance_goods_and_services', 'Under_five_mortality_from_CME',
'Under_five_mortality_from_IHME', 'Under_five_mortality_rate',
'Urban_population', 'Urban_population_growth',
'Urban_population_pct_of_total'],
dtype='object', length=358)
[11]:
for ix, col in enumerate(df_who.columns): #label
print("ix:%i\tlabel:%s"%(ix,col))
ix:0 label:Country
ix:1 label:CountryID
ix:2 label:Continent
ix:3 label:Adolescent fertility rate (%)
ix:4 label:Adult literacy rate (%)
ix:5 label:Gross national income per capita (PPP international $)
ix:6 label:Net primary school enrolment ratio female (%)
ix:7 label:Net primary school enrolment ratio male (%)
ix:8 label:Population (in thousands) total
ix:9 label:Population annual growth rate (%)
ix:10 label:Population in urban areas (%)
ix:11 label:Population living below the poverty line (% living on < US$1 per day)
ix:12 label:Population median age (years)
ix:13 label:Population proportion over 60 (%)
ix:14 label:Population proportion under 15 (%)
ix:15 label:Registration coverage of births (%)
ix:16 label:Total fertility rate (per woman)
ix:17 label:Antenatal care coverage - at least four visits (%)
ix:18 label:Antiretroviral therapy coverage among HIV-infected pregt women for PMTCT (%)
ix:19 label:Antiretroviral therapy coverage among people with advanced HIV infections (%)
ix:20 label:Births attended by skilled health personnel (%)
ix:21 label:Births by caesarean section (%)
ix:22 label:Children aged 6-59 months who received vitamin A supplementation (%)
ix:23 label:Children aged <5 years sleeping under insecticide-treated nets (%)
ix:24 label:Children aged <5 years who received any antimalarial treatment for fever (%)
ix:25 label:Children aged <5 years with ARI symptoms taken to facility (%)
ix:26 label:Children aged <5 years with diarrhoea receiving ORT (%)
ix:27 label:Contraceptive prevalence (%)
ix:28 label:Neonates protected at birth against neonatal tetanus (PAB) (%)
ix:29 label:One-year-olds immunized with MCV
ix:30 label:One-year-olds immunized with three doses of diphtheria tetanus toxoid and pertussis (DTP3) (%)
ix:31 label:One-year-olds immunized with three doses of Hepatitis B (HepB3) (%)
ix:32 label:One-year-olds immunized with three doses of Hib (Hib3) vaccine (%)
ix:33 label:Tuberculosis detection rate under DOTS (%)
ix:34 label:Tuberculosis treatment success under DOTS (%)
ix:35 label:Women who have had mammography (%)
ix:36 label:Women who have had PAP smear (%)
ix:37 label:Community and traditional health workers density (per 10 000 population)
ix:38 label:Dentistry personnel density (per 10 000 population)
ix:39 label:Environment and public health workers density (per 10 000 population)
ix:40 label:External resources for health as percentage of total expenditure on health
ix:41 label:General government expenditure on health as percentage of total expenditure on health
ix:42 label:General government expenditure on health as percentage of total government expenditure
ix:43 label:Hospital beds (per 10 000 population)
ix:44 label:Laboratory health workers density (per 10 000 population)
ix:45 label:Number of community and traditional health workers
ix:46 label:Number of dentistry personnel
ix:47 label:Number of environment and public health workers
ix:48 label:Number of laboratory health workers
ix:49 label:Number of nursing and midwifery personnel
ix:50 label:Number of other health service providers
ix:51 label:Number of pharmaceutical personnel
ix:52 label:Number of physicians
ix:53 label:Nursing and midwifery personnel density (per 10 000 population)
ix:54 label:Other health service providers density (per 10 000 population)
ix:55 label:Out-of-pocket expenditure as percentage of private expenditure on health
ix:56 label:Per capita government expenditure on health (PPP int. $)
ix:57 label:Per capita government expenditure on health at average exchange rate (US$)
ix:58 label:Per capita total expenditure on health (PPP int. $)
ix:59 label:Per capita total expenditure on health at average exchange rate (US$)
ix:60 label:Pharmaceutical personnel density (per 10 000 population)
ix:61 label:Physicians density (per 10 000 population)
ix:62 label:Private expenditure on health as percentage of total expenditure on health
ix:63 label:Private prepaid plans as percentage of private expenditure on health
ix:64 label:Ratio of health management and support workers to health service providers
ix:65 label:Ratio of nurses and midwives to physicians
ix:66 label:Social security expenditure on health as percentage of general government expenditure on health
ix:67 label:Total expenditure on health as percentage of gross domestic product
ix:68 label:Births attended by skilled health personnel (%) highest educational level of mother
ix:69 label:Births attended by skilled health personnel (%) highest wealth quintile
ix:70 label:Births attended by skilled health personnel (%) lowest educational level of mother
ix:71 label:Births attended by skilled health personnel (%) lowest wealth quintile
ix:72 label:Births attended by skilled health personnel (%) rural
ix:73 label:Births attended by skilled health personnel (%) urban
ix:74 label:Births attended by skilled health personnel difference highest lowest educational level of mother
ix:75 label:Births attended by skilled health personnel difference highest-lowest wealth quintile
ix:76 label:Births attended by skilled health personnel difference urban-rural
ix:77 label:Births attended by skilled health personnel ratio highest-lowest educational level of mother
ix:78 label:Births attended by skilled health personnel ratio highest-lowest wealth quintile
ix:79 label:Births attended by skilled health personnel ratio urban-rural
ix:80 label:Measles immunization coverage among one-year-olds (%) highest educational level of mother
ix:81 label:Measles immunization coverage among one-year-olds (%) highest wealth quintile
ix:82 label:Measles immunization coverage among one-year-olds (%) lowest educational level of mother
ix:83 label:Measles immunization coverage among one-year-olds (%) lowest wealth quintile
ix:84 label:Measles immunization coverage among one-year-olds (%) rural
ix:85 label:Measles immunization coverage among one-year-olds (%) urban
ix:86 label:Measles immunization coverage among one-year-olds difference highest-lowest educational level of mother
ix:87 label:Measles immunization coverage among one-year-olds difference highest-lowest wealth quintile
ix:88 label:Measles immunization coverage among one-year-olds difference urban-rural
ix:89 label:Measles immunization coverage among one-year-olds ratio highest-lowest educational level of mother
ix:90 label:Measles immunization coverage among one-year-olds ratio highest-lowest wealth quintile
ix:91 label:Measles immunization coverage among one-year-olds ratio urban-rural
ix:92 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) difference lowest-highest educational level of mother
ix:93 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) difference lowest-highest wealth quintile
ix:94 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) difference rural-urban
ix:95 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) highest educational level of mother
ix:96 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) highest wealth quintile
ix:97 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) lowest educational level of mother
ix:98 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) lowest wealth quintile
ix:99 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) ratio lowest-highest educational level of mother
ix:100 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) ratio lowest-highest wealth quintile
ix:101 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) ratio rural-urban
ix:102 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) rural
ix:103 label:Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) urban
ix:104 label:Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) both sexes
ix:105 label:Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) female
ix:106 label:Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) male
ix:107 label:Age-standardized mortality rate for cancer (per 100 000 population)
ix:108 label:Age-standardized mortality rate for cardiovascular diseases (per 100 000 population)
ix:109 label:Age-standardized mortality rate for injuries (per 100 000 population)
ix:110 label:Age-standardized mortality rate for non-communicable diseases (per 100 000 population)
ix:111 label:Deaths among children under five years of age due to diarrhoeal diseases (%)
ix:112 label:Deaths among children under five years of age due to HIV/AIDS (%)
ix:113 label:Deaths among children under five years of age due to injuries (%)
ix:114 label:Deaths among children under five years of age due to malaria (%)
ix:115 label:Deaths among children under five years of age due to measles (%)
ix:116 label:Deaths among children under five years of age due to neonatal causes (%)
ix:117 label:Deaths among children under five years of age due to other causes (%)
ix:118 label:Deaths among children under five years of age due to pneumonia (%)
ix:119 label:Deaths due to HIV/AIDS (per 100 000 population per year)
ix:120 label:Deaths due to tuberculosis among HIV-negative people (per 100 000 population)
ix:121 label:Deaths due to tuberculosis among HIV-positive people (per 100 000 population)
ix:122 label:Healthy life expectancy (HALE) at birth (years) both sexes
ix:123 label:Healthy life expectancy (HALE) at birth (years) female
ix:124 label:Healthy life expectancy (HALE) at birth (years) male
ix:125 label:Incidence of tuberculosis (per 100 000 population per year)
ix:126 label:Infant mortality rate (per 1 000 live births) both sexes
ix:127 label:Infant mortality rate (per 1 000 live births) female
ix:128 label:Infant mortality rate (per 1 000 live births) male
ix:129 label:Life expectancy at birth (years) both sexes
ix:130 label:Life expectancy at birth (years) female
ix:131 label:Life expectancy at birth (years) male
ix:132 label:Maternal mortality ratio (per 100 000 live births)
ix:133 label:Neonatal mortality rate (per 1 000 live births)
ix:134 label:Number of confirmed poliomyelitis cases
ix:135 label:Prevalence of HIV among adults aged >=15 years (per 100 000 population)
ix:136 label:Prevalence of tuberculosis (per 100 000 population)
ix:137 label:Under-5 mortality rate (probability of dying by age 5 per 1000 live births) both sexes
ix:138 label:Under-5 mortality rate (probability of dying by age 5 per 1000 live births) female
ix:139 label:Under-5 mortality rate (probability of dying by age 5 per 1000 live births) male
ix:140 label:Years of life lost to communicable diseases (%)
ix:141 label:Years of life lost to injuries (%)
ix:142 label:Years of life lost to non-communicable diseases (%)
ix:143 label:Children under five years of age overweight for age (%)
ix:144 label:Children under five years of age stunted for age (%)
ix:145 label:Children under five years of age underweight for age (%)
ix:146 label:Newborns with low birth weight (%)
ix:147 label:Per capita recorded alcohol consumption (litres of pure alcohol) among adults (>=15 years)
ix:148 label:Population using solid fuels (%) rural
ix:149 label:Population using solid fuels (%) urban
ix:150 label:Population with sustainable access to improved drinking water sources (%) rural
ix:151 label:Population with sustainable access to improved drinking water sources (%) total
ix:152 label:Population with sustainable access to improved drinking water sources (%) urban
ix:153 label:Population with sustainable access to improved sanitation (%) rural
ix:154 label:Population with sustainable access to improved sanitation (%) total
ix:155 label:Population with sustainable access to improved sanitation (%) urban
ix:156 label:Prevalence of adults (>=15 years) who are obese (%) female
ix:157 label:Prevalence of adults (>=15 years) who are obese (%) male
ix:158 label:Prevalence of condom use by young people (15-24 years) at higher risk sex (%) female
ix:159 label:Prevalence of condom use by young people (15-24 years) at higher risk sex (%) male
ix:160 label:Prevalence of current tobacco use among adolescents (13-15 years) (%) both sexes
ix:161 label:Prevalence of current tobacco use among adolescents (13-15 years) (%) female
ix:162 label:Prevalence of current tobacco use among adolescents (13-15 years) (%) male
ix:163 label:Prevalence of current tobacco use among adults (>=15 years) (%) both sexes
ix:164 label:Prevalence of current tobacco use among adults (>=15 years) (%) female
ix:165 label:Prevalence of current tobacco use among adults (>=15 years) (%) male
ix:166 label:Adolescent_fertility_rate
ix:167 label:Agricultural_land
ix:168 label:Agriculture_contribution_to_economy
ix:169 label:Aid_given
ix:170 label:Aid_received
ix:171 label:Aid_received_total
ix:172 label:All_forms_of_TB_new_cases_per_100_000_estimated
ix:173 label:All_forms_of_TB_new_cases_per_100_000_reported
ix:174 label:Annual_freshwater_withdrawals_total
ix:175 label:Arms_exports
ix:176 label:Arms_imports
ix:177 label:Bad_teeth_per_child
ix:178 label:Births_attended_by_skilled_health_staff
ix:179 label:Breast_cancer_deaths_per_100_000_women
ix:180 label:Breast_cancer_new_cases_per_100_000_women
ix:181 label:Breast_cancer_number_of_female_deaths
ix:182 label:Breast_cancer_number_of_new_female_cases
ix:183 label:Broadband_subscribers
ix:184 label:Broadband_subscribers_per_100_people
ix:185 label:CO2_emissions
ix:186 label:CO2_intensity_of_economic_output
ix:187 label:Capital_formation
ix:188 label:Cell_phones_per_100_people
ix:189 label:Cell_phones_total
ix:190 label:Central_bank_discount_rate
ix:191 label:Cervical_cancer_deaths_per_100_000_women
ix:192 label:Cervical_cancer_new_cases_per_100_000_women
ix:193 label:Cervical_cancer_number_of_female_deaths
ix:194 label:Cervical_cancer_number_of_new_female_cases
ix:195 label:Children_and_elderly
ix:196 label:Children_out_of_school_primary
ix:197 label:Children_out_of_school_primary_female
ix:198 label:Children_out_of_school_primary_male
ix:199 label:Children_per_woman
ix:200 label:Coal_consumption
ix:201 label:Coal_consumption_per_person
ix:202 label:Coal_production
ix:203 label:Coal_production_per_person
ix:204 label:Colon_and_Rectum_cancer_deaths_per_100_000_men
ix:205 label:Colon_and_Rectum_cancer_deaths_per_100_000_women
ix:206 label:Colon_and_Rectum_cancer_new_cases_per_100_000_men
ix:207 label:Colon_and_Rectum_cancer_new_cases_per_100_000_women
ix:208 label:Colon_and_Rectum_cancer_number_of_female_deaths
ix:209 label:Colon_and_Rectum_cancer_number_of_male_deaths
ix:210 label:Colon_and_Rectum_cancer_number_of_new_female_cases
ix:211 label:Colon_and_Rectum_cancer_number_of_new_male_cases
ix:212 label:Consumer_price_index
ix:213 label:Contraceptive_use
ix:214 label:Deaths_from_TB_per_100_000_estimated
ix:215 label:Debt_servicing_costs
ix:216 label:Democracy_score
ix:217 label:Electric_power_consumption
ix:218 label:Electricity_generation
ix:219 label:Electricity_generation_per_person
ix:220 label:Energy_use
ix:221 label:Expenditure_per_student_primary
ix:222 label:Expenditure_per_student_secondary
ix:223 label:Expenditure_per_student_tertiary
ix:224 label:Exports_of_goods_and_services
ix:225 label:Exports_unit_value
ix:226 label:External_debt_total_DOD_current_USdollars
ix:227 label:External_debt_total_pct_of_GNI
ix:228 label:Female_labour_force
ix:229 label:Fixed_line_and_mobile_phone_subscribers
ix:230 label:Foreign_direct_investment_net_inflows
ix:231 label:Foreign_direct_investment_net_outflows
ix:232 label:Forest_area
ix:233 label:Gross_capital_formation
ix:234 label:HIV_infected
ix:235 label:Health_expenditure_per_person
ix:236 label:Health_expenditure_private
ix:237 label:Health_expenditure_public_pct_of_GDP
ix:238 label:Health_expenditure_public_pct_of_government_expenditure
ix:239 label:Health_expenditure_public_pct_of_total_health_expenditure
ix:240 label:Health_expenditure_total
ix:241 label:High_technology_exports
ix:242 label:Hydroelectricity_consumption
ix:243 label:Hydroelectricity_consumption_per_person
ix:244 label:Imports_of_goods_and_services
ix:245 label:Imports_unit_value
ix:246 label:Improved_sanitation_facilities_urban
ix:247 label:Improved_water_source
ix:248 label:Income_growth
ix:249 label:Income_per_person
ix:250 label:Income_share_held_by_lowest_20pct
ix:251 label:Industry_contribution_to_economy
ix:252 label:Inequality_index
ix:253 label:Infant_mortality_rate
ix:254 label:Infectious_TB_new_cases_per_100_000_estimated
ix:255 label:Infectious_TB_new_cases_per_100_000_reported
ix:256 label:Infectious_TB_treatment_completeness
ix:257 label:Inflation_GDP_deflator
ix:258 label:Internet_users
ix:259 label:Life_expectancy_at_birth
ix:260 label:Literacy_rate_adult_female
ix:261 label:Literacy_rate_adult_male
ix:262 label:Literacy_rate_adult_total
ix:263 label:Literacy_rate_youth_female
ix:264 label:Literacy_rate_youth_male
ix:265 label:Literacy_rate_youth_total
ix:266 label:Liver_cancer_deaths_per_100_000_men
ix:267 label:Liver_cancer_deaths_per_100_000_women
ix:268 label:Liver_cancer_new_cases_per_100_000_men
ix:269 label:Liver_cancer_new_cases_per_100_000_women
ix:270 label:Liver_cancer_number_of_female_deaths
ix:271 label:Liver_cancer_number_of_male_deaths
ix:272 label:Liver_cancer_number_of_new_female_cases
ix:273 label:Liver_cancer_number_of_new_male_cases
ix:274 label:Lung_cancer_deaths_per_100_000_men
ix:275 label:Lung_cancer_deaths_per_100_000_women
ix:276 label:Lung_cancer_new_cases_per_100_000_men
ix:277 label:Lung_cancer_new_cases_per_100_000_women
ix:278 label:Lung_cancer_number_of_female_deaths
ix:279 label:Lung_cancer_number_of_male_deaths
ix:280 label:Lung_cancer_number_of_new_female_cases
ix:281 label:Lung_cancer_number_of_new_male_cases
ix:282 label:Malaria_prevention_insecticide_treated_bed_nets_usage
ix:283 label:Malaria_treatment
ix:284 label:Malnutrition_weight_for_age
ix:285 label:Market_value_of_listed_companies
ix:286 label:Maternal_mortality
ix:287 label:Math_achievement_4th_grade
ix:288 label:Math_achievement_8th_grade
ix:289 label:Measles_immunization
ix:290 label:Medical_Doctors
ix:291 label:Merchandise_trade
ix:292 label:Military_expenditure
ix:293 label:Natural_gas_consumption
ix:294 label:Natural_gas_consumption_per_person
ix:295 label:Natural_gas_production
ix:296 label:Natural_gas_production_per_person
ix:297 label:Natural_gas_proved_reserves
ix:298 label:Natural_gas_proven_reserves_per_person
ix:299 label:Net_barter_terms_of_trade
ix:300 label:Nuclear_consumption
ix:301 label:Nuclear_consumption_per_person
ix:302 label:Number_of_deaths_from_TB_estimated
ix:303 label:Number_of_existing_TB_cases_estimated
ix:304 label:Oil_consumption
ix:305 label:Oil_consumption_per_person
ix:306 label:Oil_production
ix:307 label:Oil_production_per_person
ix:308 label:Oil_proved_reserves
ix:309 label:Oil_proven_reserves_per_person
ix:310 label:Old_version_of_Income_per_person
ix:311 label:Patent_applications
ix:312 label:Patents_granted
ix:313 label:Patents_in_force
ix:314 label:People_living_with_HIV
ix:315 label:Personal_computers_per_100_people
ix:316 label:Personal_computers_total
ix:317 label:Population_growth
ix:318 label:Population_in_urban_agglomerations_more_than_1_million
ix:319 label:Population_total
ix:320 label:Poverty_headcount_ratio_at_national_poverty_line
ix:321 label:Present_value_of_debt
ix:322 label:Primary_completion_rate_total
ix:323 label:Primary_energy_consumption
ix:324 label:Primary_energy_consumption_per_person
ix:325 label:Primary_school_completion_pct_of_boys
ix:326 label:Primary_school_completion_pct_of_girls
ix:327 label:Prostate_cancer_deaths_per_100_000_men
ix:328 label:Prostate_cancer_new_cases_per_100_000_men
ix:329 label:Prostate_cancer_number_of_male_deaths
ix:330 label:Prostate_cancer_number_of_new_male_cases
ix:331 label:Pump_price_for_gasoline
ix:332 label:Ratio_of_girls_to_boys_in_primary_and_secondary_education
ix:333 label:Ratio_of_young_literate_females_to_males
ix:334 label:Roads_paved
ix:335 label:SO2_emissions_per_person
ix:336 label:Services_contribution_to_economy
ix:337 label:Stomach_cancer_deaths_per_100_000_men
ix:338 label:Stomach_cancer_deaths_per_100_000_women
ix:339 label:Stomach_cancer_new_cases_per_100_000_men
ix:340 label:Stomach_cancer_new_cases_per_100_000_women
ix:341 label:Stomach_cancer_number_of_female_deaths
ix:342 label:Stomach_cancer_number_of_male_deaths
ix:343 label:Stomach_cancer_number_of_new_female_cases
ix:344 label:Stomach_cancer_number_of_new_male_cases
ix:345 label:Sugar_per_person
ix:346 label:Surface_area
ix:347 label:Tax_revenue
ix:348 label:Total_CO2_emissions
ix:349 label:Total_income
ix:350 label:Total_reserves
ix:351 label:Trade_balance_goods_and_services
ix:352 label:Under_five_mortality_from_CME
ix:353 label:Under_five_mortality_from_IHME
ix:354 label:Under_five_mortality_rate
ix:355 label:Urban_population
ix:356 label:Urban_population_growth
ix:357 label:Urban_population_pct_of_total
Si inspeccionamos y comparamos los tipos del dataframe y de las columnas…
[12]:
type(df_who)
[12]:
pandas.core.frame.DataFrame
[13]:
type(df_who.columns)
[13]:
pandas.core.indexes.base.Index
Podemos utilizar el nombre de una columna para obtener los datos de dicha columna, tal como lo hacíamos con un diccionario python
. Veremos dos maneras diferentes de hacerlo:
[15]:
paises = df_who["Country"]
paises
[15]:
0 Afghanistan
1 Albania
2 Algeria
3 Andorra
4 Angola
...
197 Vietnam
198 West Bank and Gaza
199 Yemen
200 Zambia
201 Zimbabwe
Name: Country, Length: 202, dtype: object
¿Qué tipo de datos es una columna?
[45]:
type(df_who["Country"])
[45]:
pandas.core.series.Series
Las Series
són la otra estructura básica de Pandas. Las filas y las columnas se estructuran en Series
, se pueden ver cómo un tipo de lista que solamente puede contener un único tipo de datos, acepta operaciones vectoriales y se puede indexar de manera similar a un diccionario.
Una vez que seleccionamos una columna, podemos acceder a sus elementos como si fueran una lista:
[46]:
print(df_who["Country"][0])
print("-"*30)
print(df_who["Country"][:5]) # slicing
print("-"*30)
print(df_who["Country"].values)
Afghanistan
------------------------------
0 Afghanistan
1 Albania
2 Algeria
3 Andorra
4 Angola
Name: Country, dtype: object
------------------------------
['Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola'
'Antigua and Barbuda' 'Argentina' 'Armenia' 'Australia' 'Austria'
'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
'Belgium' 'Belize' 'Benin' 'Bermuda' 'Bhutan' 'Bolivia'
'Bosnia and Herzegovina' 'Botswana' 'Brazil' 'Brunei Darussalam'
'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia' 'Cameroon' 'Canada'
'Cape Verde' 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia'
'Comoros' 'Congo, Dem. Rep.' 'Congo, Rep.' 'Cook Islands' 'Costa Rica'
"Cote d'Ivoire" 'Croatia' 'Cuba' 'Cyprus' 'Czech Republic' 'Denmark'
'Djibouti' 'Dominica' 'Dominican Republic' 'Ecuador' 'Egypt'
'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Ethiopia' 'Fiji'
'Finland' 'France' 'French Polynesia' 'Gabon' 'Gambia' 'Georgia'
'Germany' 'Ghana' 'Greece' 'Grenada' 'Guatemala' 'Guinea' 'Guinea-Bissau'
'Guyana' 'Haiti' 'Honduras' 'Hong Kong, China' 'Hungary' 'Iceland'
'India' 'Indonesia' 'Iran (Islamic Republic of)' 'Iraq' 'Ireland'
'Israel' 'Italy' 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya'
'Kiribati' 'Korea, Dem. Rep.' 'Korea, Rep.' 'Kuwait' 'Kyrgyzstan'
"Lao People's Democratic Republic" 'Latvia' 'Lebanon' 'Lesotho' 'Liberia'
'Libyan Arab Jamahiriya' 'Lithuania' 'Luxembourg' 'Macao, China'
'Macedonia' 'Madagascar' 'Malawi' 'Malaysia' 'Maldives' 'Mali' 'Malta'
'Marshall Islands' 'Mauritania' 'Mauritius' 'Mexico'
'Micronesia (Federated States of)' 'Moldova' 'Monaco' 'Mongolia'
'Montenegro' 'Morocco' 'Mozambique' 'Myanmar' 'Namibia' 'Nauru' 'Nepal'
'Netherlands' 'Netherlands Antilles' 'New Caledonia' 'New Zealand'
'Nicaragua' 'Niger' 'Nigeria' 'Niue' 'Norway' 'Oman' 'Pakistan' 'Palau'
'Panama' 'Papua New Guinea' 'Paraguay' 'Peru' 'Philippines' 'Poland'
'Portugal' 'Puerto Rico' 'Qatar' 'Romania' 'Russia' 'Rwanda'
'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and the Grenadines'
'Samoa' 'San Marino' 'Sao Tome and Principe' 'Saudi Arabia' 'Senegal'
'Serbia' 'Seychelles' 'Sierra Leone' 'Singapore' 'Slovakia' 'Slovenia'
'Solomon Islands' 'Somalia' 'South Africa' 'Spain' 'Sri Lanka' 'Sudan'
'Suriname' 'Swaziland' 'Sweden' 'Switzerland' 'Syria' 'Taiwan'
'Tajikistan' 'Tanzania' 'Thailand' 'Timor-Leste' 'Togo' 'Tonga'
'Trinidad and Tobago' 'Tunisia' 'Turkey' 'Turkmenistan' 'Tuvalu' 'Uganda'
'Ukraine' 'United Arab Emirates' 'United Kingdom'
'United States of America' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela'
'Vietnam' 'West Bank and Gaza' 'Yemen' 'Zambia' 'Zimbabwe']
Existe otra manera más sencilla de seleccionar una única columna, pero existen nombres de columna muy largos: ‘Children aged <5 years who received any antimalarial treatment for fever (%)’.
Nota: En la creación de documentos, es importante usar nombres adecuados para las columnas.
[47]:
df_who.Country
[47]:
0 Afghanistan
1 Albania
2 Algeria
3 Andorra
4 Angola
...
197 Vietnam
198 West Bank and Gaza
199 Yemen
200 Zambia
201 Zimbabwe
Name: Country, Length: 202, dtype: object
[21]:
print(df_who.columns[9])
print("-"*30)
print(df_who[df_who.columns[9]])
Population annual growth rate (%)
------------------------------
0 4.0
1 0.6
2 1.5
3 1.0
4 2.8
...
197 1.4
198 NaN
199 3.0
200 1.9
201 0.8
Name: Population annual growth rate (%), Length: 202, dtype: float64
[23]:
# Multiples columnas
df_who[df_who.columns[0:5]]
[23]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | |
---|---|---|---|---|---|
0 | Afghanistan | 1 | 1 | 151.0 | 28.0 |
1 | Albania | 2 | 2 | 27.0 | 98.7 |
2 | Algeria | 3 | 3 | 6.0 | 69.9 |
3 | Andorra | 4 | 2 | NaN | NaN |
4 | Angola | 5 | 3 | 146.0 | 67.4 |
... | ... | ... | ... | ... | ... |
197 | Vietnam | 198 | 6 | 25.0 | 90.3 |
198 | West Bank and Gaza | 199 | 1 | NaN | NaN |
199 | Yemen | 200 | 1 | 83.0 | 54.1 |
200 | Zambia | 201 | 3 | 161.0 | 68.0 |
201 | Zimbabwe | 202 | 3 | 101.0 | 89.5 |
202 rows × 5 columns
[24]:
# Dos columnas específicas
df_who[["CountryID","Continent"]]
[24]:
CountryID | Continent | |
---|---|---|
0 | 1 | 1 |
1 | 2 | 2 |
2 | 3 | 3 |
3 | 4 | 2 |
4 | 5 | 3 |
... | ... | ... |
197 | 198 | 6 |
198 | 199 | 1 |
199 | 200 | 1 |
200 | 201 | 3 |
201 | 202 | 3 |
202 rows × 2 columns
[17]:
# Existen dos métodos loc e iloc para acceder a sub-regiones de los datos
# funció: .iloc(filas, columnas)
df2 = df_who.iloc[:,3:5]
print(df2.shape)
print(df2.head())
(202, 2)
Adolescent fertility rate (%) Adult literacy rate (%)
0 151.0 28.0
1 27.0 98.7
2 6.0 69.9
3 NaN NaN
4 146.0 67.4
Sobre Series
Cuando seleccionamos una columna de un DataFrame, obtenemos una Serie. Las Series tienen ciertas características, como la capacidad de aplicar métodos estadísticos (si son Series numéricas).
[18]:
fertilitat = df_who[df_who.columns[3]]
fertilitat
[18]:
0 151.0
1 27.0
2 6.0
3 NaN
4 146.0
...
197 25.0
198 NaN
199 83.0
200 161.0
201 101.0
Name: Adolescent fertility rate (%), Length: 202, dtype: float64
[19]:
print("Min ", fertilitat.min()) # a Pandas el concepte d'iterar "no té sentit"
print("Max ", fertilitat.max())
print("Count ", fertilitat.count())
Min 0.0
Max 199.0
Count 177
Tabla con las funciones descriptivas
Veremos que obtener esta información estadística nos puede ayudar a extraer información muy concreta de la tabla, por ejemplo, si queremos saber:
¿Qué país tiene la mayor emisión de CO2?
[20]:
co2 = df_who["Total_CO2_emissions"]
co2.max()
[20]:
5776431.5
[28]:
df_who[co2==co2.max()]["Country"].values[0]
[28]:
'United States of America'
[50]:
co2 = df_who["Total_CO2_emissions"]
co2
[50]:
0 692.50
1 3499.12
2 137535.56
3 NaN
4 8991.46
...
197 101826.23
198 655.86
199 20148.34
200 2366.94
201 11457.33
Name: Total_CO2_emissions, Length: 202, dtype: float64
[51]:
co2.max()
[51]:
5776431.5
[55]:
co2==co2.max()
[55]:
0 False
1 False
2 False
3 False
4 False
...
197 False
198 False
199 False
200 False
201 False
Name: Total_CO2_emissions, Length: 202, dtype: bool
[56]:
df_who[co2==co2.max()]
[56]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
192 | United States of America | 193 | 4 | 43.0 | NaN | 44070.0 | 93.0 | 91.0 | 302841.0 | 1.0 | ... | 5776431.5 | 1.100000e+13 | NaN | -7.140000e+11 | 8.0 | 7.1 | 8.0 | 240000000.0 | 1.39 | 80.8 |
1 rows × 358 columns
[57]:
df_who[co2==co2.max()]["Country"]
[57]:
192 United States of America
Name: Country, dtype: object
[58]:
df_who[co2==co2.max()]["Country"].values
[58]:
array(['United States of America'], dtype=object)
[59]:
df_who[co2==co2.max()]["Country"].values[0]
[59]:
'United States of America'
[64]:
# También existen métodos para seleccionar de manera aleatoria muestras dentro de una serie
# Tip: Métodos montecarlo
fertilidad = df_who[df_who.columns[3]]
some = fertilidad.sample(n=3)
print(some)
fertilidad.sample(n=3,random_state=2) #on random_state és la llavor/seed del aleatori
112 NaN
24 71.0
15 22.0
Name: Adolescent fertility rate (%), dtype: float64
[64]:
53 48.0
149 28.0
13 135.0
Name: Adolescent fertility rate (%), dtype: float64
Filas
Cada fila tiene un índice. El índice puede ser numérico, alfabético o temporal.
[65]:
df_who.index
[65]:
RangeIndex(start=0, stop=202, step=1)
[66]:
df_who.index.values
[66]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,
104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
195, 196, 197, 198, 199, 200, 201])
Así como seleccionamos columnas, podemos seleccionar información con para obtener filas. Para realizar la consulta de una fila concreta usaremos el atributo loc
de los dataframes.
[30]:
import pandas as pd
df = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe
[31]:
print(df.loc[0])
Country Afghanistan
CountryID 1
Continent 1
Adolescent fertility rate (%) 151.0
Adult literacy rate (%) 28.0
...
Under_five_mortality_from_IHME 231.9
Under_five_mortality_rate 257.0
Urban_population 5740436.0
Urban_population_growth 5.44
Urban_population_pct_of_total 22.9
Name: 0, Length: 358, dtype: object
Si lo que necesitamos es obtener son los valores, necesitaremos el atributo values
:
[3]:
df.loc[0].values
[3]:
array(['Afghanistan', 1, 1, 151.0, 28.0, nan, nan, nan, 26088.0, 4.0,
23.0, nan, 16.0, 4.0, 47.0, 6.0, 7.2, nan, nan, nan, 14.0, nan,
nan, nan, nan, nan, nan, 10.3, 73.0, 70.0, 83.0, 83.0, nan, 66.0,
90.0, nan, nan, nan, nan, nan, 20.1, 27.5, 4.4, 4.0, nan, nan,
900.0, nan, nan, 14930.0, nan, 900.0, 5970.0, 5.0, nan, 97.2, 8.0,
6.0, 29.0, 23.0, nan, 2.0, 72.5, 0.0, nan, 2.5, 0.0, 5.4, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, 473.0, 443.0, 500.0, 153.0,
706.0, 134.0, 1269.0, 18.9, 0.3, 1.1, 1.0, 5.9, 26.0, 22.1, 24.8,
10.0, 32.0, 0.0, 36.0, 36.0, 35.0, 161.0, 165.0, 154.0, 176.0,
42.0, 43.0, 42.0, 1800.0, 60.0, 16.0, 100.0, 231.0, 257.0, 254.0,
260.0, 76.0, 6.0, 18.0, 4.6, 59.3, 32.9, nan, 0.01, nan, nan, 17.0,
22.0, 37.0, 25.0, 30.0, 45.0, nan, nan, nan, nan, 9.8, 3.2, 13.1,
nan, nan, nan, nan, 58.35, 36.1, nan, 108.83, 2750000000.0, 168.0,
87.0, 42.29, 0.0, 28000000.0, 2.9, 14.3, 11.7, 26.8, 874.0, 2021.0,
220.0, 0.000878, 0.02, 0.04, 62300000000.0, 4.0, 600000.0, nan,
3.6, 6.9, 254.0, 511.0, 96.8, nan, nan, 1792633.0, 7.07, nan, nan,
nan, nan, 3.3, 2.8, 5.2, 4.5, 193.0, 236.0, 316.0, 377.0, nan,
10.3, 33.0, nan, -7.0, nan, nan, nan, nan, nan, nan, nan, 12.43,
nan, nan, nan, nan, 5.19, 0.0, nan, 8670.0, 25.05, nan, 20.0, 4.16,
1.04, 3.3, 20.0, 5.2, nan, nan, nan, 55.66, nan, 49.0, 39.0, nan,
874.0, nan, 24.48, nan, 165.0, 76.0, 40.0, 90.0, 11.9, 1.0, 43.4,
12.59, 43.14, 28.0, 18.39, 50.81, 34.26, 3.5, 2.3, 3.7, 2.5, 147.0,
218.0, 155.0, 233.0, 11.3, 2.7, 12.2, 2.9, 173.0, 675.0, 190.0,
732.0, nan, nan, nan, nan, 1900.0, nan, nan, 64.0, 0.19, 39.42,
9.93, nan, nan, nan, nan, nan, nan, nan, nan, nan, 8242.0, 66826.0,
nan, nan, nan, nan, nan, nan, 717.04, nan, nan, nan, nan, nan, nan,
nan, nan, 29900000.0, nan, nan, 37.73, nan, nan, nan, nan, 2.8,
4.5, 151.0, 249.0, 0.68, 55.57, 36.2, 23.66, 3.14, 39.42, 15.8,
8.3, 18.5, 9.7, 499.0, 936.0, 592.0, 1108.0, nan, 652090.0, nan,
692.5, nan, nan, nan, 257.0, 231.9, 257.0, 5740436.0, 5.44, 22.9],
dtype=object)
[32]:
print(df_who.loc[0].Country)
print(df_who.loc[0][0])
print(df_who.loc[0][3])
Afghanistan
Afghanistan
151.0
Utilizando el atributo loc
del dataframe podemos seleccionar y filtrar las filas (y columnas) mediante labels. En el caso de filas, el label es el índice y si éste es númerico podemos usar los slicing típicos de Python
.
.loc(row_labels,columns_labels)
Recordemos el slicing:
sublista = lista[start:stop:step]
Dónde: * start: Posición de la lista original dónde empieza la sublista. Si no se indica és 0. * stop: Posición de la lista original hasta donde seleccionar. Se selecciona hasta la posición stop - 1. * step: Incremento entre cada índice de la selección, por defecto 1.
Si entendemos el concepto para un array…
[22]:
array =[1,2,3,4,5,6,7,9,0]
print(array[0:2]) #**
print(array[3:]) #**
print(array[:3]) #**
[1, 2]
[4, 5, 6, 7, 9, 0]
[1, 2, 3]
… podemos hacer las mismas operaciones con las filas de un dataFrame.
[4]:
df.loc[4:10:2]
[4]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | Angola | 5 | 3 | 146.0 | 67.4 | 3890.0 | 49.0 | 51.0 | 16557.0 | 2.8 | ... | 8991.46 | 1.490000e+10 | 27.13 | 9.140000e+09 | 164.1 | 242.5 | 164.1 | 8578749.0 | 4.14 | 53.3 |
6 | Argentina | 7 | 5 | 62.0 | 97.2 | 11670.0 | 98.0 | 99.0 | 39134.0 | 1.0 | ... | 152711.86 | 3.140000e+11 | 21.11 | 1.190000e+10 | 18.1 | 16.7 | 18.1 | 34900000.0 | 1.17 | 90.1 |
8 | Australia | 9 | 6 | 16.0 | NaN | 33940.0 | 97.0 | 96.0 | 20530.0 | 1.1 | ... | 368858.53 | 4.680000e+11 | NaN | -1.280000e+10 | 5.9 | 5.1 | 5.9 | 18000000.0 | 1.54 | 88.2 |
10 | Azerbaijan | 11 | 2 | 31.0 | 98.8 | 5430.0 | 83.0 | 86.0 | 8406.0 | 0.6 | ... | 36629.01 | 9.930000e+09 | 64.89 | 1.330000e+09 | 50.0 | 63.7 | 50.0 | 4321803.0 | 1.26 | 51.5 |
4 rows × 358 columns
También se puede realizar una selección particular mediante una lista:
[5]:
df.loc[[3,10,29,34]]
[5]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Andorra | 4 | 2 | NaN | NaN | NaN | 83.0 | 83.0 | 74.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 | Azerbaijan | 11 | 2 | 31.0 | 98.8 | 5430.0 | 83.0 | 86.0 | 8406.0 | 0.6 | ... | 36629.01 | 9.930000e+09 | 64.89 | 1.330000e+09 | 50.0 | 63.7 | 50.0 | 4321803.0 | 1.26 | 51.5 |
29 | Cambodia | 30 | 7 | 52.0 | 73.6 | 1550.0 | 89.0 | 91.0 | 14197.0 | 1.7 | ... | 538.61 | 5.680000e+09 | 32.94 | -5.470000e+08 | 97.3 | 94.3 | 97.3 | 2749235.0 | 4.58 | 19.7 |
34 | Chad | 35 | 3 | 193.0 | 25.7 | 1170.0 | 49.0 | 71.0 | 10468.0 | 3.1 | ... | 139.23 | 2.790000e+09 | 14.17 | -2.210000e+08 | 208.2 | 180.1 | 208.2 | 2566839.0 | 4.88 | 25.3 |
4 rows × 358 columns
Selección de filas y columnas
Si seguimos con la misma lógica, usando el atributo loc
de los dataFrames.
[6]:
import pandas as pd
df = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe
[11]:
df.loc[0:1]
[11]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 1 | 1 | 151.0 | 28.0 | NaN | NaN | NaN | 26088.0 | 4.0 | ... | 692.50 | NaN | NaN | NaN | 257.00 | 231.9 | 257.00 | 5740436.0 | 5.44 | 22.9 |
1 | Albania | 2 | 2 | 27.0 | 98.7 | 6000.0 | 93.0 | 94.0 | 3172.0 | 0.6 | ... | 3499.12 | 4.790000e+09 | 78.14 | -2.040000e+09 | 18.47 | 15.5 | 18.47 | 1431793.9 | 2.21 | 45.4 |
2 rows × 358 columns
Las columnas se deben seleccionar con una lista que debe contener el nombre de las columnas deseadas.
[12]:
df.loc[0:1,["Continent"]]
[12]:
Continent | |
---|---|
0 | 1 |
1 | 2 |
[13]:
df.loc[0:3,["Continent","Total_CO2_emissions"]]
[13]:
Continent | Total_CO2_emissions | |
---|---|---|
0 | 1 | 692.50 |
1 | 2 | 3499.12 |
2 | 3 | 137535.56 |
3 | 2 | NaN |
[15]:
df.loc[[0,3,20,100],["Country","Continent","Total_CO2_emissions"]]
[15]:
Country | Continent | Total_CO2_emissions | |
---|---|---|---|
0 | Afghanistan | 1 | 692.50 |
3 | Andorra | 2 | NaN |
20 | Bhutan | 7 | 414.03 |
100 | Liberia | 3 | 472.66 |
Alternativamente, con el atributo iloc podemos seleccionar las columnas con su índice numérico: su posicion en la lista de columnas.
[16]:
df.iloc[0]
[16]:
Country Afghanistan
CountryID 1
Continent 1
Adolescent fertility rate (%) 151.0
Adult literacy rate (%) 28.0
...
Under_five_mortality_from_IHME 231.9
Under_five_mortality_rate 257.0
Urban_population 5740436.0
Urban_population_growth 5.44
Urban_population_pct_of_total 22.9
Name: 0, Length: 358, dtype: object
[17]:
df.iloc[0:4, 3:7] # ídem a una matriz
[17]:
Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | |
---|---|---|---|---|
0 | 151.0 | 28.0 | NaN | NaN |
1 | 27.0 | 98.7 | 6000.0 | 93.0 |
2 | 6.0 | 69.9 | 5940.0 | 94.0 |
3 | NaN | NaN | NaN | 83.0 |
[18]:
df.iloc[0][0]
[18]:
'Afghanistan'
[19]:
df.iloc[0][0:4]
[19]:
Country Afghanistan
CountryID 1
Continent 1
Adolescent fertility rate (%) 151.0
Name: 0, dtype: object
[20]:
df.iloc[0][0:4].values
[20]:
array(['Afghanistan', 1, 1, 151.0], dtype=object)
[21]:
df.iloc[0][0:4].values[0]
[21]:
'Afghanistan'
Selección condicional
Además de la selección con base a índices, lo interesante es realizar selecciones mediante condiciones lógicas que permiten filtrar las filas del dataset. Por ejemplo:
[34]:
df_who = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe
alfabetitzacio = df_who[df_who['Adult literacy rate (%)'] > 70][["Country","Adult literacy rate (%)"]]
alfabetitzacio[alfabetitzacio["Country"] == "Italy"]
[34]:
Country | Adult literacy rate (%) | |
---|---|---|
85 | Italy | 98.4 |
[8]:
df_who['Adult literacy rate (%)'] > 70
[8]:
0 False
1 True
2 False
3 False
4 False
...
197 True
198 False
199 False
200 False
201 True
Name: Adult literacy rate (%), Length: 202, dtype: bool
[37]:
seleccio = df_who['Adult literacy rate (%)'] > 70
[38]:
# En aquest codi, de les files on seleccio == True agafam les dues columnes que ens interessen
df_who[seleccio][["Country","Adult literacy rate (%)"]]
[38]:
Country | Adult literacy rate (%) | |
---|---|---|
1 | Albania | 98.7 |
6 | Argentina | 97.2 |
7 | Armenia | 99.4 |
10 | Azerbaijan | 98.8 |
12 | Bahrain | 86.5 |
... | ... | ... |
193 | Uruguay | 96.8 |
195 | Vanuatu | 75.5 |
196 | Venezuela | 93.0 |
197 | Vietnam | 90.3 |
201 | Zimbabwe | 89.5 |
93 rows × 2 columns
[39]:
# Multiples criterios
# CO_emisions y Fertilidad
import numpy as np
ix = np.where((df_who["Total_CO2_emissions"] > 10) & (df_who[df_who.columns[3]] <=0.6))
ix
[39]:
(array([92]),)
Estadísticas en un DataFrame
[35]:
# Creamos un dataframe con valores aleatorios
import numpy as np
np.random.seed(10)
df = pd.DataFrame({"one":np.random.randint(-10,10,5),
"two":np.random.random(5),
"three":np.random.randint(0,5,5)})
df
[35]:
one | two | three | |
---|---|---|---|
0 | -1 | 0.748804 | 0 |
1 | -6 | 0.498507 | 2 |
2 | 5 | 0.224797 | 0 |
3 | -10 | 0.198063 | 4 |
4 | 7 | 0.760531 | 3 |
[36]:
df.T
[36]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
one | -1.000000 | -6.000000 | 5.000000 | -10.000000 | 7.000000 |
two | 0.748804 | 0.498507 | 0.224797 | 0.198063 | 0.760531 |
three | 0.000000 | 2.000000 | 0.000000 | 4.000000 | 3.000000 |
[12]:
df.sum()
[12]:
one -5.000000
two 2.430701
three 9.000000
dtype: float64
[13]:
df.sum(axis=1) # concepto de axis
[13]:
0 -0.251196
1 -3.501493
2 5.224797
3 -5.801937
4 10.760531
dtype: float64
[14]:
df.cumsum()
[14]:
one | two | three | |
---|---|---|---|
0 | -1 | 0.748804 | 0 |
1 | -7 | 1.247311 | 2 |
2 | -2 | 1.472108 | 2 |
3 | -12 | 1.670170 | 6 |
4 | -5 | 2.430701 | 9 |
[15]:
df.cumsum(axis=1)
[15]:
one | two | three | |
---|---|---|---|
0 | -1.0 | -0.251196 | -0.251196 |
1 | -6.0 | -5.501493 | -3.501493 |
2 | 5.0 | 5.224797 | 5.224797 |
3 | -10.0 | -9.801937 | -5.801937 |
4 | 7.0 | 7.760531 | 10.760531 |
[16]:
df.apply(np.abs,axis=1).cumsum() # función apply
[16]:
one | two | three | |
---|---|---|---|
0 | 1 | 0.748804 | 0 |
1 | 7 | 1.247311 | 2 |
2 | 12 | 1.472108 | 2 |
3 | 22 | 1.670170 | 6 |
4 | 29 | 2.430701 | 9 |
[17]:
df.apply(np.abs).cumsum(axis=1)
[17]:
one | two | three | |
---|---|---|---|
0 | 1.0 | 1.748804 | 1.748804 |
1 | 6.0 | 6.498507 | 8.498507 |
2 | 5.0 | 5.224797 | 5.224797 |
3 | 10.0 | 10.198063 | 14.198063 |
4 | 7.0 | 7.760531 | 10.760531 |
[18]:
df.cumprod()
[18]:
one | two | three | |
---|---|---|---|
0 | -1 | 0.748804 | 0 |
1 | 6 | 0.373284 | 0 |
2 | 30 | 0.083913 | 0 |
3 | -300 | 0.016620 | 0 |
4 | -2100 | 0.012640 | 0 |
[19]:
df.cumprod(axis=1)
[19]:
one | two | three | |
---|---|---|---|
0 | -1.0 | -0.748804 | -0.000000 |
1 | -6.0 | -2.991042 | -5.982084 |
2 | 5.0 | 1.123983 | 0.000000 |
3 | -10.0 | -1.980629 | -7.922515 |
4 | 7.0 | 5.323715 | 15.971145 |
[21]:
ones = np.ones(5) # numpy arrays
print(ones)
[1. 1. 1. 1. 1.]
[22]:
df.sub(ones,axis=0)
[22]:
one | two | three | |
---|---|---|---|
0 | -2.0 | -0.251196 | -1.0 |
1 | -7.0 | -0.501493 | 1.0 |
2 | 4.0 | -0.775203 | -1.0 |
3 | -11.0 | -0.801937 | 3.0 |
4 | 6.0 | -0.239469 | 2.0 |
[23]:
df.sub(ones) #Alerta!
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb Cell 131 line <cell line: 1>()
----> <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#Y241sZmlsZQ%3D%3D?line=0'>1</a> df.sub(ones)
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/ops/__init__.py:436, in flex_arith_method_FRAME.<locals>.f(self, other, axis, level, fill_value)
433 axis = self._get_axis_number(axis) if axis is not None else 1
435 other = maybe_prepare_scalar_for_op(other, self.shape)
--> 436 self, other = align_method_FRAME(self, other, axis, flex=True, level=level)
438 if isinstance(other, ABCDataFrame):
439 # Another DataFrame
440 new_data = self._combine_frame(other, na_op, fill_value)
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/ops/__init__.py:248, in align_method_FRAME(left, right, axis, flex, level)
245 if isinstance(right, np.ndarray):
247 if right.ndim == 1:
--> 248 right = to_series(right)
250 elif right.ndim == 2:
251 if right.shape == left.shape:
File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/ops/__init__.py:239, in align_method_FRAME.<locals>.to_series(right)
237 else:
238 if len(left.columns) != len(right):
--> 239 raise ValueError(
240 msg.format(req_len=len(left.columns), given_len=len(right))
241 )
242 right = left._constructor_sliced(right, index=left.columns)
243 return right
ValueError: Unable to coerce to Series, length must be 3: given 5
[48]:
# nornalitzación z-score (media a 0 y desviación a 1)
ts_stand = (df - df.mean()) / df.std()
ts_stand
[48]:
one | two | three | |
---|---|---|---|
0 | 0.000000 | 0.966021 | -1.006231 |
1 | -0.696733 | 0.045482 | 0.111803 |
2 | 0.836080 | -0.961166 | -1.006231 |
3 | -1.254119 | -1.059487 | 1.229837 |
4 | 1.114773 | 1.009150 | 0.670820 |
[49]:
print(ts_stand.mean())
print("----")
print(ts_stand.std())
one 0.000000e+00
two 2.664535e-16
three 6.661338e-17
dtype: float64
----
one 1.0
two 1.0
three 1.0
dtype: float64
Series no númericas
[24]:
df_who["Country"]
[24]:
0 Afghanistan
1 Albania
2 Algeria
3 Andorra
4 Angola
...
197 Vietnam
198 West Bank and Gaza
199 Yemen
200 Zambia
201 Zimbabwe
Name: Country, Length: 202, dtype: object
[53]:
#https://www.w3schools.com/python/python_ref_string.asp
df_who.Country.str.casefold()
[53]:
0 afghanistan
1 albania
2 algeria
3 andorra
4 angola
...
197 vietnam
198 west bank and gaza
199 yemen
200 zambia
201 zimbabwe
Name: Country, Length: 202, dtype: object
[25]:
df_who.Country.str.zfill(12)
[25]:
0 0Afghanistan
1 00000Albania
2 00000Algeria
3 00000Andorra
4 000000Angola
...
197 00000Vietnam
198 West Bank and Gaza
199 0000000Yemen
200 000000Zambia
201 0000Zimbabwe
Name: Country, Length: 202, dtype: object
Modificación del dataframe
Además de realizar selecciones, en algunos momentos necesitaremos incorporar nueva información a nuestras tablas de datos.
Vamos a crear un pequeño conjunto para practicar:
[37]:
df2 = pd.DataFrame([('Foreign Cinema', 'Restaurant', 289.0),
('Liho Liho', 'Restaurant', 224.0),
('500 Club', 'bar', 80.5),
('The Square', 'bar', 25.30)],
columns=('name', 'type', 'AvgBill')
)
df2
[37]:
name | type | AvgBill | |
---|---|---|---|
0 | Foreign Cinema | Restaurant | 289.0 |
1 | Liho Liho | Restaurant | 224.0 |
2 | 500 Club | bar | 80.5 |
3 | The Square | bar | 25.3 |
Tenemos diversas maneras de añadir columnas a un dataFrame:
Mediante el nombre de la columna que queremos añadir, tal como añadimos una nueva clave a un diccionario.
insert
: es un método que necesita 3 parámetros. La posición en la que queremos añadir la columna (loc
), su nombre (´column´) y la lista de valores (value
).assign
: muy similar a la anterior, pero permite añadir múltiples columnas.concat
: no se suele usar para concatenar columnas, en el caso que queramos usarlo para este caso, deberemos poner el parámetroaxis=1
.
Veamos algunos ejemplos:
[38]:
df2['Day'] = "Monday" # Como un diccionario
df2
[38]:
name | type | AvgBill | Day | |
---|---|---|---|---|
0 | Foreign Cinema | Restaurant | 289.0 | Monday |
1 | Liho Liho | Restaurant | 224.0 | Monday |
2 | 500 Club | bar | 80.5 | Monday |
3 | The Square | bar | 25.3 | Monday |
[39]:
df2['Day'] = ['Monday', 'Tuesday', 'Wednesday']
df2
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[39], line 1
----> 1 df2['Day'] = ['Monday', 'Tuesday', 'Wednesday']
2 df2
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/frame.py:3980, in DataFrame.__setitem__(self, key, value)
3977 self._setitem_array([key], value)
3978 else:
3979 # set column
-> 3980 self._set_item(key, value)
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/frame.py:4174, in DataFrame._set_item(self, key, value)
4164 def _set_item(self, key, value) -> None:
4165 """
4166 Add series to DataFrame in specified column.
4167
(...)
4172 ensure homogeneity.
4173 """
-> 4174 value = self._sanitize_column(value)
4176 if (
4177 key in self.columns
4178 and value.ndim == 1
4179 and not is_extension_array_dtype(value)
4180 ):
4181 # broadcast across multiple columns if necessary
4182 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/frame.py:4915, in DataFrame._sanitize_column(self, value)
4912 return _reindex_for_setitem(Series(value), self.index)
4914 if is_list_like(value):
-> 4915 com.require_length_match(value, self.index)
4916 return sanitize_array(value, self.index, copy=True, allow_2d=True)
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/common.py:571, in require_length_match(data, index)
567 """
568 Check the length of data matches the length of the index.
569 """
570 if len(data) != len(index):
--> 571 raise ValueError(
572 "Length of values "
573 f"({len(data)}) "
574 "does not match length of index "
575 f"({len(index)})"
576 )
ValueError: Length of values (3) does not match length of index (4)
[40]:
#Vamos a usar el método insert
df2.insert(loc=1, column="Stars", value=[2,2,3,4])
df2
[40]:
name | Stars | type | AvgBill | Day | |
---|---|---|---|---|---|
0 | Foreign Cinema | 2 | Restaurant | 289.0 | Monday |
1 | Liho Liho | 2 | Restaurant | 224.0 | Monday |
2 | 500 Club | 3 | bar | 80.5 | Monday |
3 | The Square | 4 | bar | 25.3 | Monday |
[42]:
df2["AvgBillIVA"] = df2["AvgBill"]*1.21
df2
[42]:
name | Stars | type | AvgBill | Day | AvgBillIVA | |
---|---|---|---|---|---|---|
0 | Foreign Cinema | 2 | Restaurant | 289.0 | Monday | 349.690 |
1 | Liho Liho | 2 | Restaurant | 224.0 | Monday | 271.040 |
2 | 500 Club | 3 | bar | 80.5 | Monday | 97.405 |
3 | The Square | 4 | bar | 25.3 | Monday | 30.613 |
[41]:
df3 = df2.assign(AvgHalfBill=df2.AvgBill / 2, Michelin_Star=3)
df3
df3["HOLA"] = df3.name.str.capitalize()
[34]:
df3
[34]:
name | Stars | type | AvgBill | Day | AvgHalfBill | Michelin_Star | HOLA | |
---|---|---|---|---|---|---|---|---|
0 | Foreign Cinema | 2 | Restaurant | 289.0 | Monday | 144.50 | 3 | Foreign cinema |
1 | Liho Liho | 2 | Restaurant | 224.0 | Tuesday | 112.00 | 3 | Liho liho |
2 | 500 Club | 3 | bar | 80.5 | Wednesday | 40.25 | 3 | 500 club |
3 | The Square | 4 | bar | 25.3 | Thursday | 12.65 | 3 | The square |
Para añadir filas a un dataframe tenemos dos métodos que realizan la tarea y devuelven un nuevo dataFrame:
append: añade una fila a un dataFrame. La fila puede ser un diccionario, una
Serie
o otroDataframe
. El parámetroignore_index = True
significa que se ignorará el índice de la serie o el dataFrame de origen. En su lugar, se utilizará el índice disponible en el dataFrame de destino. El valorFalse
significa lo contrario.concat: concatena dos o más dataFrames separados por comas.
Tenemos el método drop
que nos proporciona un nuevo dataFrame sin la(s) fila(s) o la(s) columna(s) que seleccionemos. Si queremos eliminar columnas podemos hacerlo especificando la lista de columnas en el parámetro columns
de la siguiente manera:
[43]:
df3
[43]:
name | Stars | type | AvgBill | Day | AvgHalfBill | Michelin_Star | HOLA | |
---|---|---|---|---|---|---|---|---|
0 | Foreign Cinema | 2 | Restaurant | 289.0 | Monday | 144.50 | 3 | Foreign cinema |
1 | Liho Liho | 2 | Restaurant | 224.0 | Monday | 112.00 | 3 | Liho liho |
2 | 500 Club | 3 | bar | 80.5 | Monday | 40.25 | 3 | 500 club |
3 | The Square | 4 | bar | 25.3 | Monday | 12.65 | 3 | The square |
[49]:
df3.drop(columns=["Stars"],inplace=True) # Eliminamos la última columna que hemos creado
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[49], line 1
----> 1 df3.drop(columns=["Stars"],inplace=True) # Eliminamos la última columna que hemos creado
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
325 if len(args) > num_allow_args:
326 warnings.warn(
327 msg.format(arguments=_format_argument_list(allow_args)),
328 FutureWarning,
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/frame.py:5399, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
5251 @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"])
5252 def drop( # type: ignore[override]
5253 self,
(...)
5260 errors: IgnoreRaise = "raise",
5261 ) -> DataFrame | None:
5262 """
5263 Drop specified labels from rows or columns.
5264
(...)
5397 weight 1.0 0.8
5398 """
-> 5399 return super().drop(
5400 labels=labels,
5401 axis=axis,
5402 index=index,
5403 columns=columns,
5404 level=level,
5405 inplace=inplace,
5406 errors=errors,
5407 )
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
325 if len(args) > num_allow_args:
326 warnings.warn(
327 msg.format(arguments=_format_argument_list(allow_args)),
328 FutureWarning,
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/generic.py:4505, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
4503 for axis, labels in axes.items():
4504 if labels is not None:
-> 4505 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
4507 if inplace:
4508 self._update_inplace(obj)
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/generic.py:4546, in NDFrame._drop_axis(self, labels, axis, level, errors, only_slice)
4544 new_axis = axis.drop(labels, level=level, errors=errors)
4545 else:
-> 4546 new_axis = axis.drop(labels, errors=errors)
4547 indexer = axis.get_indexer(new_axis)
4549 # Case for non-unique axis
4550 else:
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/indexes/base.py:6934, in Index.drop(self, labels, errors)
6932 if mask.any():
6933 if errors != "ignore":
-> 6934 raise KeyError(f"{list(labels[mask])} not found in axis")
6935 indexer = indexer[~mask]
6936 return self.delete(indexer)
KeyError: "['Stars'] not found in axis"
[48]:
df3
[48]:
name | type | AvgBill | Day | AvgHalfBill | Michelin_Star | HOLA | |
---|---|---|---|---|---|---|---|
0 | Foreign Cinema | Restaurant | 289.0 | Monday | 144.50 | 3 | Foreign cinema |
1 | Liho Liho | Restaurant | 224.0 | Monday | 112.00 | 3 | Liho liho |
2 | 500 Club | bar | 80.5 | Monday | 40.25 | 3 | 500 club |
3 | The Square | bar | 25.3 | Monday | 12.65 | 3 | The square |
Para poder eliminar filas, usamos la misma función, esta vez sin el parámetro que hemos usado anteriormente, simplemente indicamos los índices a eliminar:
[29]:
df_less_rows = df_no_michelin.drop([1,3])
df_less_rows
[29]:
name | Stars | type | AvgBill | Day | AvgHalfBill | |
---|---|---|---|---|---|---|
0 | Foreign Cinema | 2 | Restaurant | 289.0 | Monday | 144.50 |
2 | 500 Club | 3 | bar | 80.5 | Wednesday | 40.25 |
[30]:
dfs1 = df3[df3.Stars>=3] # per una selecció?
dfs1
[30]:
name | Stars | type | AvgBill | Day | AvgHalfBill | Michelin_Star | |
---|---|---|---|---|---|---|---|
2 | 500 Club | 3 | bar | 80.5 | Wednesday | 40.25 | 3 |
3 | The Square | 4 | bar | 25.3 | Thursday | 12.65 | 3 |
[31]:
df3[df3.Stars>=3].index
[31]:
Int64Index([2, 3], dtype='int64')
[32]:
df3.drop(df3[df3.Stars>=3].index)
[32]:
name | Stars | type | AvgBill | Day | AvgHalfBill | Michelin_Star | |
---|---|---|---|---|---|---|---|
0 | Foreign Cinema | 2 | Restaurant | 289.0 | Monday | 144.5 | 3 |
1 | Liho Liho | 2 | Restaurant | 224.0 | Tuesday | 112.0 | 3 |
[41]:
df3.drop(df3[df3.Stars>=3].index, inplace=True) #Alerta con la integración de los cambios
df3
[41]:
name | Stars | type | AvgBill | Day | AvgHalfBill | |
---|---|---|---|---|---|---|
0 | Foreign Cinema | 2 | Restaurant | 289.0 | Monday | 144.50 |
2 | 500 Club | 3 | bar | 80.5 | Wednesday | 40.25 |
Concatenación y unión de dataframes
A veces, los datos vienen en diferentes archivos y necesitan ser combinados en un único archivo, este proceso implica la concatenación de dataframes. Otras veces, los datos son complementarios, es decir, hay nuevas columnas en un dataframe, y esto se conoce como realizar operaciones de unión (joins).
[50]:
df1 = pd.DataFrame([('Foreign Cinema', 'Restaurant', 289.0),
('Liho Liho', 'Restaurant', 224.0),
('500 Club', 'bar', 80.5),
('The Square', 'bar', 25.30)],
columns=('name', 'type', 'AvgBill')
)
df1
[50]:
name | type | AvgBill | |
---|---|---|---|
0 | Foreign Cinema | Restaurant | 289.0 |
1 | Liho Liho | Restaurant | 224.0 |
2 | 500 Club | bar | 80.5 |
3 | The Square | bar | 25.3 |
[51]:
df2 = pd.DataFrame([('Biels', 'Quiosquet', 389.0),
('Barkus', 'Bar', 24.0),
('Blue Wall', 'bar', 80.5),
('Bounty Hunters', 'Social Club', 125.30)],
columns=('name2', 'type', 'AvgBill')
)
df2
[51]:
name2 | type | AvgBill | |
---|---|---|---|
0 | Biels | Quiosquet | 389.0 |
1 | Barkus | Bar | 24.0 |
2 | Blue Wall | bar | 80.5 |
3 | Bounty Hunters | Social Club | 125.3 |
[53]:
dfAll = pd.concat([df1,df2])
dfAll
[53]:
name | type | AvgBill | name2 | |
---|---|---|---|---|
0 | Foreign Cinema | Restaurant | 289.0 | NaN |
1 | Liho Liho | Restaurant | 224.0 | NaN |
2 | 500 Club | bar | 80.5 | NaN |
3 | The Square | bar | 25.3 | NaN |
0 | NaN | Quiosquet | 389.0 | Biels |
1 | NaN | Bar | 24.0 | Barkus |
2 | NaN | bar | 80.5 | Blue Wall |
3 | NaN | Social Club | 125.3 | Bounty Hunters |
[42]:
dfAll = dfAll.reset_index()
dfAll
[42]:
index | name | type | AvgBill | |
---|---|---|---|---|
0 | 0 | Foreign Cinema | Restaurant | 289.0 |
1 | 1 | Liho Liho | Restaurant | 224.0 |
2 | 2 | 500 Club | bar | 80.5 |
3 | 3 | The Square | bar | 25.3 |
4 | 0 | Biels | Quiosquet | 389.0 |
5 | 1 | Barkus | Bar | 24.0 |
6 | 2 | Blue Wall | bar | 80.5 |
7 | 3 | Bounty Hunters | Social Club | 125.3 |
[ ]:
dfAll.drop(columns=["index"])
[ ]:
# https://pandas.pydata.org/docs/reference/api/pandas.concat.html
dfAll = pd.concat([df1,df2],ignore_index=True)
dfAll
Más información: https://pandas.pydata.org/docs/user_guide/merging.html
[54]:
df1 = pd.DataFrame([('Jhon', 1, "Italy"),
('Pep', 2, "Germany"),
('William', 3, "Finland"),
('Snake', 4, "Italy")],
columns=('name', 'ID', 'Country')
)
df1
[54]:
name | ID | Country | |
---|---|---|---|
0 | Jhon | 1 | Italy |
1 | Pep | 2 | Germany |
2 | William | 3 | Finland |
3 | Snake | 4 | Italy |
[55]:
df2 = pd.DataFrame([(1, 145.0, 3000.1),
( 2, 189.2, 2030.2),
( 3, 129.0, 3000.0),
( 4, 198.1, 4020.2)],
columns=('DNI', 'Weight', 'Salary')
)
df2
[55]:
DNI | Weight | Salary | |
---|---|---|---|
0 | 1 | 145.0 | 3000.1 |
1 | 2 | 189.2 | 2030.2 |
2 | 3 | 129.0 | 3000.0 |
3 | 4 | 198.1 | 4020.2 |
[ ]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
df1.merge(df2) ## ALERTA ! Necesita más parametros
---------------------------------------------------------------------------
MergeError Traceback (most recent call last)
Cell In[56], line 3
1 # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
----> 3 df1.merge(df2)
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/frame.py:10093, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
10074 @Substitution("")
10075 @Appender(_merge_doc, indents=2)
10076 def merge(
(...)
10089 validate: str | None = None,
10090 ) -> DataFrame:
10091 from pandas.core.reshape.merge import merge
> 10093 return merge(
10094 self,
10095 right,
10096 how=how,
10097 on=on,
10098 left_on=left_on,
10099 right_on=right_on,
10100 left_index=left_index,
10101 right_index=right_index,
10102 sort=sort,
10103 suffixes=suffixes,
10104 copy=copy,
10105 indicator=indicator,
10106 validate=validate,
10107 )
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/reshape/merge.py:110, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
93 @Substitution("\nleft : DataFrame or named Series")
94 @Appender(_merge_doc, indents=0)
95 def merge(
(...)
108 validate: str | None = None,
109 ) -> DataFrame:
--> 110 op = _MergeOperation(
111 left,
112 right,
113 how=how,
114 on=on,
115 left_on=left_on,
116 right_on=right_on,
117 left_index=left_index,
118 right_index=right_index,
119 sort=sort,
120 suffixes=suffixes,
121 indicator=indicator,
122 validate=validate,
123 )
124 return op.get_result(copy=copy)
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/reshape/merge.py:685, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate)
681 # stacklevel chosen to be correct when this is reached via pd.merge
682 # (and not DataFrame.join)
683 warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())
--> 685 self.left_on, self.right_on = self._validate_left_right_on(left_on, right_on)
687 cross_col = None
688 if self.how == "cross":
File ~/.pyenv/versions/3.11.0rc2/envs/my3110/lib/python3.11/site-packages/pandas/core/reshape/merge.py:1434, in _MergeOperation._validate_left_right_on(self, left_on, right_on)
1432 common_cols = left_cols.intersection(right_cols)
1433 if len(common_cols) == 0:
-> 1434 raise MergeError(
1435 "No common columns to perform merge on. "
1436 f"Merge options: left_on={left_on}, "
1437 f"right_on={right_on}, "
1438 f"left_index={self.left_index}, "
1439 f"right_index={self.right_index}"
1440 )
1441 if (
1442 not left_cols.join(common_cols, how="inner").is_unique
1443 or not right_cols.join(common_cols, how="inner").is_unique
1444 ):
1445 raise MergeError(f"Data columns not unique: {repr(common_cols)}")
MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
[48]:
df1.merge(df2, left_on='ID', right_on='DNI')
[48]:
name | ID | Country | DNI | Weight | Salary | |
---|---|---|---|---|---|---|
0 | Jhon | 1 | Italy | 1 | 145.0 | 3000.1 |
1 | Pep | 2 | Germany | 2 | 189.2 | 2030.2 |
2 | William | 3 | Finland | 3 | 129.0 | 3000.0 |
3 | Snake | 4 | Italy | 4 | 198.1 | 4020.2 |
[57]:
df2 = pd.DataFrame([(1, 145.0, 3000.1), #No 2
( 3, 129.0, 3000.0), # Multiples 3
( 3, 159.0, 4000.0),
( 3, 109.0, 5000.0),
( 4, 198.1, 4020.2),
( 5, 200.0, 5000.2)], #a new one
columns=('DNI', 'Weight', 'Salary')
)
df2
[57]:
DNI | Weight | Salary | |
---|---|---|---|
0 | 1 | 145.0 | 3000.1 |
1 | 3 | 129.0 | 3000.0 |
2 | 3 | 159.0 | 4000.0 |
3 | 3 | 109.0 | 5000.0 |
4 | 4 | 198.1 | 4020.2 |
5 | 5 | 200.0 | 5000.2 |
[58]:
df1.merge(df2, left_on='ID', right_on='DNI')
[58]:
name | ID | Country | DNI | Weight | Salary | |
---|---|---|---|---|---|---|
0 | Jhon | 1 | Italy | 1 | 145.0 | 3000.1 |
1 | William | 3 | Finland | 3 | 129.0 | 3000.0 |
2 | William | 3 | Finland | 3 | 159.0 | 4000.0 |
3 | William | 3 | Finland | 3 | 109.0 | 5000.0 |
4 | Snake | 4 | Italy | 4 | 198.1 | 4020.2 |
[51]:
df1.merge(df2, left_on='ID', right_on='DNI',how="left")
[51]:
name | ID | Country | DNI | Weight | Salary | |
---|---|---|---|---|---|---|
0 | Jhon | 1 | Italy | 1.0 | 145.0 | 3000.1 |
1 | Pep | 2 | Germany | NaN | NaN | NaN |
2 | William | 3 | Finland | 3.0 | 129.0 | 3000.0 |
3 | William | 3 | Finland | 3.0 | 159.0 | 4000.0 |
4 | William | 3 | Finland | 3.0 | 109.0 | 5000.0 |
5 | Snake | 4 | Italy | 4.0 | 198.1 | 4020.2 |
[52]:
df1.merge(df2, left_on='ID', right_on='DNI',how="right")
[52]:
name | ID | Country | DNI | Weight | Salary | |
---|---|---|---|---|---|---|
0 | Jhon | 1.0 | Italy | 1 | 145.0 | 3000.1 |
1 | William | 3.0 | Finland | 3 | 129.0 | 3000.0 |
2 | William | 3.0 | Finland | 3 | 159.0 | 4000.0 |
3 | William | 3.0 | Finland | 3 | 109.0 | 5000.0 |
4 | Snake | 4.0 | Italy | 4 | 198.1 | 4020.2 |
5 | NaN | NaN | NaN | 5 | 200.0 | 5000.2 |
[53]:
df1.merge(df2, left_on='ID', right_on='DNI',how="inner")
[53]:
name | ID | Country | DNI | Weight | Salary | |
---|---|---|---|---|---|---|
0 | Jhon | 1 | Italy | 1 | 145.0 | 3000.1 |
1 | William | 3 | Finland | 3 | 129.0 | 3000.0 |
2 | William | 3 | Finland | 3 | 159.0 | 4000.0 |
3 | William | 3 | Finland | 3 | 109.0 | 5000.0 |
4 | Snake | 4 | Italy | 4 | 198.1 | 4020.2 |
[54]:
df1.merge(df2,how="cross")
[54]:
name | ID | Country | DNI | Weight | Salary | |
---|---|---|---|---|---|---|
0 | Jhon | 1 | Italy | 1 | 145.0 | 3000.1 |
1 | Jhon | 1 | Italy | 3 | 129.0 | 3000.0 |
2 | Jhon | 1 | Italy | 3 | 159.0 | 4000.0 |
3 | Jhon | 1 | Italy | 3 | 109.0 | 5000.0 |
4 | Jhon | 1 | Italy | 4 | 198.1 | 4020.2 |
5 | Jhon | 1 | Italy | 5 | 200.0 | 5000.2 |
6 | Pep | 2 | Germany | 1 | 145.0 | 3000.1 |
7 | Pep | 2 | Germany | 3 | 129.0 | 3000.0 |
8 | Pep | 2 | Germany | 3 | 159.0 | 4000.0 |
9 | Pep | 2 | Germany | 3 | 109.0 | 5000.0 |
10 | Pep | 2 | Germany | 4 | 198.1 | 4020.2 |
11 | Pep | 2 | Germany | 5 | 200.0 | 5000.2 |
12 | William | 3 | Finland | 1 | 145.0 | 3000.1 |
13 | William | 3 | Finland | 3 | 129.0 | 3000.0 |
14 | William | 3 | Finland | 3 | 159.0 | 4000.0 |
15 | William | 3 | Finland | 3 | 109.0 | 5000.0 |
16 | William | 3 | Finland | 4 | 198.1 | 4020.2 |
17 | William | 3 | Finland | 5 | 200.0 | 5000.2 |
18 | Snake | 4 | Italy | 1 | 145.0 | 3000.1 |
19 | Snake | 4 | Italy | 3 | 129.0 | 3000.0 |
20 | Snake | 4 | Italy | 3 | 159.0 | 4000.0 |
21 | Snake | 4 | Italy | 3 | 109.0 | 5000.0 |
22 | Snake | 4 | Italy | 4 | 198.1 | 4020.2 |
23 | Snake | 4 | Italy | 5 | 200.0 | 5000.2 |
Usando el fichero who.csv, se pide:
[62]:
# read file csv
import pandas as pd
df_who = pd.read_csv("data/WHO.csv")
df_who.head()
[62]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 1 | 1 | 151.0 | 28.0 | NaN | NaN | NaN | 26088.0 | 4.0 | ... | 692.50 | NaN | NaN | NaN | 257.00 | 231.9 | 257.00 | 5740436.0 | 5.44 | 22.9 |
1 | Albania | 2 | 2 | 27.0 | 98.7 | 6000.0 | 93.0 | 94.0 | 3172.0 | 0.6 | ... | 3499.12 | 4.790000e+09 | 78.14 | -2.040000e+09 | 18.47 | 15.5 | 18.47 | 1431793.9 | 2.21 | 45.4 |
2 | Algeria | 3 | 3 | 6.0 | 69.9 | 5940.0 | 94.0 | 96.0 | 33351.0 | 1.5 | ... | 137535.56 | 6.970000e+10 | 351.36 | 4.700000e+09 | 40.00 | 31.2 | 40.00 | 20800000.0 | 2.61 | 63.3 |
3 | Andorra | 4 | 2 | NaN | NaN | NaN | 83.0 | 83.0 | 74.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Angola | 5 | 3 | 146.0 | 67.4 | 3890.0 | 49.0 | 51.0 | 16557.0 | 2.8 | ... | 8991.46 | 1.490000e+10 | 27.13 | 9.140000e+09 | 164.10 | 242.5 | 164.10 | 8578749.0 | 4.14 | 53.3 |
5 rows × 358 columns
1) ¿Cuál és la media de la población urbana (“Urban_population”) de todos los países? ¿Su desviación típica (std)?
[64]:
df_who["Urban_population"].mean()
[64]:
16657626.767446807
[65]:
df_who["Urban_population"].std()
[65]:
50948665.823935635
2) Consulta la fila del país: “Spain”
[ ]:
df_who[df_who["Country"]=="Spain"]
3a) ¿Qué país tiene una mayor población urbana?
[70]:
df_who[df_who["Urban_population"] == df_who["Urban_population"].max()]
[70]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
36 | China | 37 | 7 | 3.0 | 90.9 | 4660.0 | 96.0 | 100.0 | 1328474.0 | 0.6 | ... | 5547757.5 | 1.890000e+12 | 295.23 | 1.250000e+11 | 27.3 | 27.8 | 27.3 | 527000000.0 | 2.95 | 40.4 |
1 rows × 358 columns
3b) ¿Qué paises tienen una población urbana menor a 50000 ?
[71]:
df_who[df_who["Urban_population"] < 50000 ]
[71]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | Antigua and Barbuda | 6 | 4 | NaN | NaN | 15130.0 | NaN | NaN | 84.0 | 1.3 | ... | 421.36 | 830000000.0 | NaN | -102000000.0 | 12.60 | NaN | 12.60 | 32468.25 | 2.25 | 39.1 |
69 | Grenada | 70 | 5 | 53.0 | NaN | 8770.0 | 83.0 | 84.0 | 106.0 | 0.3 | ... | 234.50 | 414000000.0 | 23.19 | -220000000.0 | 22.00 | NaN | 22.00 | 32589.00 | 0.45 | 30.6 |
91 | Kiribati | 92 | 6 | NaN | NaN | 6230.0 | 98.0 | 96.0 | 94.0 | 1.7 | ... | 25.65 | 51900000.0 | NaN | -20800000.0 | 66.00 | NaN | 66.00 | 46926.00 | 3.08 | 47.4 |
151 | Saint Kitts and Nevis | 152 | 5 | NaN | NaN | 12440.0 | 78.0 | 64.0 | 50.0 | 1.3 | ... | 135.57 | 387000000.0 | 24.28 | -84300000.0 | 21.00 | NaN | 21.00 | 15456.00 | 1.77 | 32.2 |
152 | Saint Lucia | 153 | 5 | 51.0 | NaN | 8500.0 | 97.0 | 99.0 | 163.0 | 1.1 | ... | 370.06 | 764000000.0 | 27.39 | -119000000.0 | 17.70 | 14.0 | 17.70 | 45482.32 | 1.15 | 27.6 |
154 | Samoa | 155 | 6 | 45.0 | 98.6 | 5090.0 | 91.0 | 90.0 | 185.0 | 0.8 | ... | 150.22 | 286000000.0 | 12.46 | -116000000.0 | 29.98 | NaN | 29.98 | 41181.28 | 1.18 | 22.4 |
160 | Seychelles | 161 | 3 | NaN | 91.8 | 14360.0 | 100.0 | 99.0 | 86.0 | 0.7 | ... | 578.91 | 563000000.0 | 8.34 | -187000000.0 | 13.54 | NaN | 13.54 | 43854.10 | 1.20 | 52.9 |
182 | Tonga | 183 | 6 | 17.0 | 98.9 | 5470.0 | 94.0 | 97.0 | 100.0 | 0.5 | ... | 117.25 | 167000000.0 | 57.30 | -94600000.0 | 24.48 | NaN | 24.48 | 23846.64 | 1.04 | 24.0 |
8 rows × 358 columns
4) ¿El continente donde está situado Spain es el mismo que el de `United States of America?
Utiliza una condición para obtener un resultado Booleano (True o False)
[72]:
sample_spain = df_who[df_who["Country"]=="Spain"]
sample_spain
[72]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | Spain | 169 | 2 | 10.0 | 97.2 | 28200.0 | 99.0 | 100.0 | 43887.0 | 1.1 | ... | 343701.53 | 6.780000e+11 | NaN | -5.770000e+10 | 4.9 | 4.2 | 4.9 | 33300000.0 | 1.75 | 76.7 |
1 rows × 358 columns
[73]:
sample_eeuu = df_who[df_who["Country"]=="United States of America"]
sample_eeuu
[73]:
Country | CountryID | Continent | Adolescent fertility rate (%) | Adult literacy rate (%) | Gross national income per capita (PPP international $) | Net primary school enrolment ratio female (%) | Net primary school enrolment ratio male (%) | Population (in thousands) total | Population annual growth rate (%) | ... | Total_CO2_emissions | Total_income | Total_reserves | Trade_balance_goods_and_services | Under_five_mortality_from_CME | Under_five_mortality_from_IHME | Under_five_mortality_rate | Urban_population | Urban_population_growth | Urban_population_pct_of_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
192 | United States of America | 193 | 4 | 43.0 | NaN | 44070.0 | 93.0 | 91.0 | 302841.0 | 1.0 | ... | 5776431.5 | 1.100000e+13 | NaN | -7.140000e+11 | 8.0 | 7.1 | 8.0 | 240000000.0 | 1.39 | 80.8 |
1 rows × 358 columns
[79]:
sample_spain["Continent"].values[0] == sample_eeuu["Continent"].values[0]
[79]:
False
5) ¿Cuáles son los cinco paises más contaminantes (“Total_CO2_emissions”)?
Esta es mi pista para una solución elegante: http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.sort_values.html
[ ]:
6) Observando algunas muestras del fichero puedes establecer la relación entre el identificador del continente y su nombre?
Es decir, sabemos que Spain está en el continente Europeo y el código del continente es el 2.
Existen los códigos de continentes: 1, 2, 3, 4, 5, 6, 7
Nota: Hay dos códigos asociados a Asia.
Haz las consultas pertinentes al dataframe para construir un diccionario con la siguiente estructura:
[ ]:
codigoContinentes = {1:"Asia",2:"Europa"} #Al menos hay 7!
print(codigoContinentes[2])
[ ]:
codigoContinentes = {1:"Asia",2:"Europa",3:"", 4:""} # TODO...
7) Una vez identificado el nombre de los continentes, ¿puedes cambiar la columna de identificadores de continentes por sus respectivos nombres?
Esta es es mi pista para una solución elegante: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html
[ ]:
8) Puedes crear un nuevo dataframe con aquellos paises que sean de Europa?
[ ]:
df2 = df #Con una simple asignación ya creas una dataframe
type(df2)
9) ¿Cuáles son los paises más contaminantes de Europa?
[ ]:
10) Calcula la cantidad de ayuda recibida por cada municipio en función del númeto total de habitantes.
[7]:
import pandas as pd
import random
random.seed(0)
nombres = [f'Municipio{i}' for i in range(1, 11)]
data_municipios = {
'Nombre': nombres,
'Código Postal': [random.randint(10000, 99999) for _ in range(10)],
'Población': [random.randint(1000, 50000) for _ in range(10)] # Añadimos un atributo aleatorio, en este caso "Población"
}
df_municipios = pd.DataFrame(data_municipios)
data_ayudas = {
'Nombre': [f'Municipio{i}' for i in range(1, 11)],
'Ayuda Económica (en euros)': [random.randint(1000, 10000) for _ in range(10)],
'Número de Beneficiarios': [random.randint(10, 100) for _ in range(10)]
}
df_ayudas = pd.DataFrame(data_ayudas)
print("Dataframe de Municipios:")
print(df_municipios)
print("\nDataframe de Ayudas:")
print(df_ayudas)
Dataframe de Municipios:
Nombre Código Postal Población
0 Municipio1 60494 39232
1 Municipio2 65125 15315
2 Municipio3 15306 34075
3 Municipio4 43936 10127
4 Municipio5 77013 19470
5 Municipio6 73691 10158
6 Municipio7 63075 7214
7 Municipio8 49755 41525
8 Municipio9 72468 17417
9 Municipio10 56930 35902
Dataframe de Ayudas:
Nombre Ayuda Económica (en euros) Número de Beneficiarios
0 Municipio1 3407 88
1 Municipio2 6081 91
2 Municipio3 2618 36
3 Municipio4 2208 80
4 Municipio5 6409 71
5 Municipio6 8735 66
6 Municipio7 2649 76
7 Municipio8 6796 43
8 Municipio9 8113 17
9 Municipio10 6180 80
Usando el fichero climaMallorca.csv, se pide: 11) ¿Cual es la temperatura máxima cuando el viento es inferior a 10? ¿Cuántas muestras hay?
[ ]:
12) ¿Cual es la temperatura máxima cuando el viento es superior a 10 y inferior a 20? ¿Cuántas muestras hay?
[ ]:
Isaac Lera and Gabriel Moya Universitat de les Illes Balears isaac.lera@uib.edu, gabriel.moya@uib.edu