Pandas

Pandas es la librería por excelencia para el análisis de datos del lenguaje Python. Su nombre proviene de “panel data” (término econométrico). Inspirada en las funcionalidades de R, pero con el potencial de este lenguaje de propósito general.

Pandas incluye todas las funcionalidades necesarias para el proceso de análisis de datos: carga, filtrado, tratamiento, síntesis, agrupamiento, almacenamiento y visualización. Además, se integra con el resto de librerías de cálculo numérico como Numpy, Matplotlib, scikit-learn, … y de despliegue: HPC, Cloud, etc.

En resumen, es como una hoja de cálculo -por ejemplo excel- pero con más mucho más potencial!!!

Características principales

Todo el trabajo que realizaremos es sobre la estructura de datos básica: el dataFrame.

Un dataFrame es un objeto de dos dimensiones que contiene información. También puede verse como una hoja de cálculo, como una tabla de un modelo entidad-relación, o como una colección de una base de datos no relacional.

Documentación

[ ]:
%pip install pandas

Importación de la libreria

import pandas as pd
import pandas
from pandas import *

Por convención se hace de la siguiente manera, todas las funciones de la libreria se tienen que llamar con el prefijo pd.*:

[1]:
import pandas as pd

Carga de datos

Vamos a aprender Pandas a través de una serie de proyectos y ejemplos. En esta primera fase, vamos a cargar datos de un fichero CSV, recordad que son ficheros donde los atributos/valores de una observación están separados por una coma y las observaciones se separan mediante un salto de línea.

Podemos descargar los datos con los que trabajaremos del siguiente enlace WHO dataset

Empezamos viendo como se carga un dataframe a partir de un fichero en formato CSV.

[5]:
df = pd.read_csv("data/WHO.csv")
[6]:
df
[6]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
0 Afghanistan 1 1 151.0 28.0 NaN NaN NaN 26088.0 4.0 ... 692.50 NaN NaN NaN 257.00 231.9 257.00 5740436.0 5.44 22.9
1 Albania 2 2 27.0 98.7 6000.0 93.0 94.0 3172.0 0.6 ... 3499.12 4.790000e+09 78.14 -2.040000e+09 18.47 15.5 18.47 1431793.9 2.21 45.4
2 Algeria 3 3 6.0 69.9 5940.0 94.0 96.0 33351.0 1.5 ... 137535.56 6.970000e+10 351.36 4.700000e+09 40.00 31.2 40.00 20800000.0 2.61 63.3
3 Andorra 4 2 NaN NaN NaN 83.0 83.0 74.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Angola 5 3 146.0 67.4 3890.0 49.0 51.0 16557.0 2.8 ... 8991.46 1.490000e+10 27.13 9.140000e+09 164.10 242.5 164.10 8578749.0 4.14 53.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
197 Vietnam 198 6 25.0 90.3 2310.0 91.0 96.0 86206.0 1.4 ... 101826.23 4.480000e+10 47.11 -1.940000e+09 20.20 23.4 20.20 21900000.0 2.90 26.4
198 West Bank and Gaza 199 1 NaN NaN NaN NaN NaN NaN NaN ... 655.86 3.780000e+09 NaN NaN 28.00 25.8 28.00 2596216.0 3.33 71.6
199 Yemen 200 1 83.0 54.1 2090.0 65.0 85.0 21732.0 3.0 ... 20148.34 1.150000e+10 114.52 8.310000e+08 82.40 87.9 82.40 5759120.5 4.37 27.3
200 Zambia 201 3 161.0 68.0 1140.0 94.0 90.0 11696.0 1.9 ... 2366.94 4.090000e+09 10.41 -4.470000e+08 175.30 163.8 175.30 4017411.0 1.95 35.0
201 Zimbabwe 202 3 101.0 89.5 NaN 88.0 87.0 13228.0 0.8 ... 11457.33 5.620000e+09 3.39 -1.710000e+08 106.50 67.0 106.50 4709965.0 1.90 35.9

202 rows × 358 columns

A continuación se muestra la estructura interna del DataFrame. Se puede ver que és muy parecido a una tabla bidimensional:

[3]:
df
[3]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
0 Afghanistan 1 1 151.0 28.0 NaN NaN NaN 26088.0 4.0 ... 692.50 NaN NaN NaN 257.00 231.9 257.00 5740436.0 5.44 22.9
1 Albania 2 2 27.0 98.7 6000.0 93.0 94.0 3172.0 0.6 ... 3499.12 4.790000e+09 78.14 -2.040000e+09 18.47 15.5 18.47 1431793.9 2.21 45.4
2 Algeria 3 3 6.0 69.9 5940.0 94.0 96.0 33351.0 1.5 ... 137535.56 6.970000e+10 351.36 4.700000e+09 40.00 31.2 40.00 20800000.0 2.61 63.3
3 Andorra 4 2 NaN NaN NaN 83.0 83.0 74.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Angola 5 3 146.0 67.4 3890.0 49.0 51.0 16557.0 2.8 ... 8991.46 1.490000e+10 27.13 9.140000e+09 164.10 242.5 164.10 8578749.0 4.14 53.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
197 Vietnam 198 6 25.0 90.3 2310.0 91.0 96.0 86206.0 1.4 ... 101826.23 4.480000e+10 47.11 -1.940000e+09 20.20 23.4 20.20 21900000.0 2.90 26.4
198 West Bank and Gaza 199 1 NaN NaN NaN NaN NaN NaN NaN ... 655.86 3.780000e+09 NaN NaN 28.00 25.8 28.00 2596216.0 3.33 71.6
199 Yemen 200 1 83.0 54.1 2090.0 65.0 85.0 21732.0 3.0 ... 20148.34 1.150000e+10 114.52 8.310000e+08 82.40 87.9 82.40 5759120.5 4.37 27.3
200 Zambia 201 3 161.0 68.0 1140.0 94.0 90.0 11696.0 1.9 ... 2366.94 4.090000e+09 10.41 -4.470000e+08 175.30 163.8 175.30 4017411.0 1.95 35.0
201 Zimbabwe 202 3 101.0 89.5 NaN 88.0 87.0 13228.0 0.8 ... 11457.33 5.620000e+09 3.39 -1.710000e+08 106.50 67.0 106.50 4709965.0 1.90 35.9

202 rows × 358 columns

df.shape

[8]:
df.shape[0]
[8]:
202

Atributos de un DataFrame

Un dataframe dispone de diferentes atributos con los que podemos obtener su información o metainformación. Los siguientes ejemplos muestran cómo se pueden consultar sus dimensiones o un listado del nombre de sus columnas:

[4]:
df.shape # Ver las dimensiones **
[4]:
(202, 358)
[10]:
df.columns[0]
[10]:
'Country'

Podemos aplicar sobre el listado de columnas todas las operaciones sobre listas que hemos visto en la introducción del curso. A continuación tenemos dos ejemplos de indexación:

[12]:
df.columns[-2]
[12]:
'Urban_population_growth'
[13]:
df.columns[0:5]
[13]:
Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
       'Adult literacy rate (%)'],
      dtype='object')
[14]:
df.columns[-5:-1]
[14]:
Index(['Under_five_mortality_from_IHME', 'Under_five_mortality_rate',
       'Urban_population', 'Urban_population_growth'],
      dtype='object')

¿Cómo consultariais el nombre de la columna 10? ¿y los de las columnas de la 200 a la 225?

[8]:
df.columns[200:226]

df.columns[len(df.columns)-3:len(df.columns)]
[8]:
Index(['Urban_population', 'Urban_population_growth',
       'Urban_population_pct_of_total'],
      dtype='object')

Funciones descriptivas de un dataframe

Pandas ofrece una colección de funciones que permiten realizar una inspección general de la tabla de datos:

  • describe: muestra estadísticas descriptivas básicas para todas las columnas numéricas.

  • info: muestra todas las columnas y sus tipos de datos.

  • head i tail: muestra las \(n\) primeras/últimas filas. El valor de \(n\) es un parámetro de este método.

[15]:
df.describe()
[15]:
CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) Population in urban areas (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
count 202.000000 202.000000 177.000000 131.000000 178.000000 179.000000 179.000000 1.930000e+02 193.000000 193.000000 ... 1.860000e+02 1.780000e+02 128.000000 1.710000e+02 181.000000 170.000000 181.000000 1.880000e+02 188.000000 188.000000
mean 101.500000 3.579208 59.457627 78.871756 11250.112360 84.033520 85.698324 3.409805e+04 1.297927 54.911917 ... 1.483596e+05 2.015567e+11 57.253516 3.424012e+08 56.677624 54.356471 56.677624 1.665763e+07 2.165851 55.195213
std 58.456537 1.808263 49.105286 20.415760 12586.753417 17.788047 15.451212 1.304957e+05 1.163864 23.554182 ... 6.133091e+05 9.400689e+11 138.669298 5.943043e+10 60.060929 61.160556 60.060929 5.094867e+07 1.596628 23.742122
min 1.000000 1.000000 0.000000 23.600000 260.000000 6.000000 11.000000 2.000000e+00 -2.500000 10.000000 ... 2.565000e+01 5.190000e+07 0.990000 -7.140000e+11 2.900000 3.000000 2.900000 1.545600e+04 -1.160000 10.000000
25% 51.250000 2.000000 19.000000 68.400000 2112.500000 79.000000 79.500000 1.340000e+03 0.500000 36.000000 ... 1.672615e+03 3.317500e+09 16.292500 -1.210000e+09 12.400000 8.475000 12.400000 9.171623e+05 1.105000 35.650000
50% 101.500000 3.000000 46.000000 86.500000 6175.000000 90.000000 90.000000 6.762000e+03 1.300000 57.000000 ... 1.021157e+04 1.145000e+10 28.515000 -2.240000e+08 29.980000 27.600000 29.980000 3.427661e+06 1.945000 57.300000
75% 151.750000 5.000000 91.000000 95.300000 14502.500000 96.000000 96.000000 2.173200e+04 2.100000 73.000000 ... 6.549217e+04 8.680000e+10 55.310000 1.024000e+09 88.700000 82.900000 88.700000 9.837113e+06 3.252500 72.750000
max 202.000000 7.000000 199.000000 99.800000 60870.000000 100.000000 100.000000 1.328474e+06 4.300000 100.000000 ... 5.776432e+06 1.100000e+13 1334.860000 1.390000e+11 267.000000 253.700000 267.000000 5.270000e+08 7.850000 100.000000

8 rows × 357 columns

[16]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Columns: 358 entries, Country to Urban_population_pct_of_total
dtypes: float64(355), int64(2), object(1)
memory usage: 565.1+ KB
[18]:
df.head(3)
[18]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
0 Afghanistan 1 1 151.0 28.0 NaN NaN NaN 26088.0 4.0 ... 692.50 NaN NaN NaN 257.00 231.9 257.00 5740436.0 5.44 22.9
1 Albania 2 2 27.0 98.7 6000.0 93.0 94.0 3172.0 0.6 ... 3499.12 4.790000e+09 78.14 -2.040000e+09 18.47 15.5 18.47 1431793.9 2.21 45.4
2 Algeria 3 3 6.0 69.9 5940.0 94.0 96.0 33351.0 1.5 ... 137535.56 6.970000e+10 351.36 4.700000e+09 40.00 31.2 40.00 20800000.0 2.61 63.3

3 rows × 358 columns

[21]:
df.tail(2)
[21]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
200 Zambia 201 3 161.0 68.0 1140.0 94.0 90.0 11696.0 1.9 ... 2366.94 4.090000e+09 10.41 -447000000.0 175.3 163.8 175.3 4017411.0 1.95 35.0
201 Zimbabwe 202 3 101.0 89.5 NaN 88.0 87.0 13228.0 0.8 ... 11457.33 5.620000e+09 3.39 -171000000.0 106.5 67.0 106.5 4709965.0 1.90 35.9

2 rows × 358 columns

[20]:
df.tail(2)

[20]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
200 Zambia 201 3 161.0 68.0 1140.0 94.0 90.0 11696.0 1.9 ... 2366.94 4.090000e+09 10.41 -447000000.0 175.3 163.8 175.3 4017411.0 1.95 35.0
201 Zimbabwe 202 3 101.0 89.5 NaN 88.0 87.0 13228.0 0.8 ... 11457.33 5.620000e+09 3.39 -171000000.0 106.5 67.0 106.5 4709965.0 1.90 35.9

2 rows × 358 columns

Carga de datos (segunda parte)

Desafortunadamente, la estructura y codificación de los datos en los archivos CSV varía según la herramienta o el sistema operativo. Por lo tanto, podemos encontrarnos con separadores entre columnas que no sean la típica coma (‘,’) o formatos de codificación de texto que no sean abiertos (por ejemplo, utf-8, ansi, etc.).

Es por esto que la función read_csv es muy versátil. Puedes consultar su documentación.

Vamos a ver qué sucede cuando se obtienen datos de una administración pública. Puedes encontrar el archivo disponible en: ‘data/presupuesto_gastos_2023.csv’. Puedes acceder a estos datos a través del siguiente enlace.

[23]:
df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv") # quins errors genera ?
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb Cell 30 line <cell line: 1>()
----> <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#X33sZmlsZQ%3D%3D?line=0'>1</a> df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv")

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:950, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    946     defaults={"delimiter": ","},
    947 )
    948 kwds.update(kwds_defaults)
--> 950 return _read(filepath_or_buffer, kwds)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:605, in _read(filepath_or_buffer, kwds)
    602 _validate_names(kwds.get("names", None))
    604 # Create the parser.
--> 605 parser = TextFileReader(filepath_or_buffer, **kwds)
    607 if chunksize or iterator:
    608     return parser

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1442, in TextFileReader.__init__(self, f, engine, **kwds)
   1439     self.options["has_index_names"] = kwds["has_index_names"]
   1441 self.handles: IOHandles | None = None
-> 1442 self._engine = self._make_engine(f, self.engine)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1753, in TextFileReader._make_engine(self, f, engine)
   1750     raise ValueError(msg)
   1752 try:
-> 1753     return mapping[engine](f, **self.options)
   1754 except Exception:
   1755     if self.handles is not None:

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py:79, in CParserWrapper.__init__(self, src, **kwds)
     76     kwds.pop(key, None)
     78 kwds["dtype"] = ensure_dtype_objs(kwds.get("dtype", None))
---> 79 self._reader = parsers.TextReader(src, **kwds)
     81 self.unnamed_cols = self._reader.unnamed_cols
     83 # error: Cannot determine type of 'names'

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:547, in pandas._libs.parsers.TextReader.__cinit__()

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:636, in pandas._libs.parsers.TextReader._get_header()

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:852, in pandas._libs.parsers.TextReader._tokenize_rows()

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:1965, in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 1697: invalid start byte

Codificación de caracteres

Python utilitza una representación basada en Unicode (https://home.unicode.org/). Otros sistemas operativos y programas utilizan otro tipo de representaciones.

var = "camión"
var = "lul·lià"
var = "Ζεύς"
var = "ประเทศไทย"
var = "日本語で"

Codificaciones: - listado de codificaciones - UTF-8

[27]:
# cp1250 | windows-1250 | Central and Eastern Europe

df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv",sep=";")
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb Cell 32 line <cell line: 3>()
      <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#X35sZmlsZQ%3D%3D?line=0'>1</a> # cp1250 | windows-1250 | Central and Eastern Europe
----> <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#X35sZmlsZQ%3D%3D?line=2'>3</a> df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv",sep=";")

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:950, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    946     defaults={"delimiter": ","},
    947 )
    948 kwds.update(kwds_defaults)
--> 950 return _read(filepath_or_buffer, kwds)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:605, in _read(filepath_or_buffer, kwds)
    602 _validate_names(kwds.get("names", None))
    604 # Create the parser.
--> 605 parser = TextFileReader(filepath_or_buffer, **kwds)
    607 if chunksize or iterator:
    608     return parser

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1442, in TextFileReader.__init__(self, f, engine, **kwds)
   1439     self.options["has_index_names"] = kwds["has_index_names"]
   1441 self.handles: IOHandles | None = None
-> 1442 self._engine = self._make_engine(f, self.engine)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1753, in TextFileReader._make_engine(self, f, engine)
   1750     raise ValueError(msg)
   1752 try:
-> 1753     return mapping[engine](f, **self.options)
   1754 except Exception:
   1755     if self.handles is not None:

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py:79, in CParserWrapper.__init__(self, src, **kwds)
     76     kwds.pop(key, None)
     78 kwds["dtype"] = ensure_dtype_objs(kwds.get("dtype", None))
---> 79 self._reader = parsers.TextReader(src, **kwds)
     81 self.unnamed_cols = self._reader.unnamed_cols
     83 # error: Cannot determine type of 'names'

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:547, in pandas._libs.parsers.TextReader.__cinit__()

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:636, in pandas._libs.parsers.TextReader._get_header()

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:852, in pandas._libs.parsers.TextReader._tokenize_rows()

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:1965, in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 1697: invalid start byte
[ ]:
df_gastos
[6]:
df_gastos = pd.read_csv("data/presupuesto_gastos_2023.csv",delimiter="\t",encoding="cp1250")

Otras maneras de cargar datos

[30]:
# por la dirección del fichero en web
df_who = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe
[31]:
df_who.head(2)
[31]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
0 Afghanistan 1 1 151.0 28.0 NaN NaN NaN 26088.0 4.0 ... 692.50 NaN NaN NaN 257.00 231.9 257.00 5740436.0 5.44 22.9
1 Albania 2 2 27.0 98.7 6000.0 93.0 94.0 3172.0 0.6 ... 3499.12 4.790000e+09 78.14 -2.040000e+09 18.47 15.5 18.47 1431793.9 2.21 45.4

2 rows × 358 columns

[33]:
# extracción de contenido en una página web
url = "https://es.wikipedia.org/wiki/Anexo:Comunidades_y_ciudades_aut%C3%B3nomas_de_Espa%C3%B1a"

comunidades_esp = pd.io.html.read_html(url)
comunidades_esp[0] # Alerta!

[33]:
Unnamed: 0 Nombre Capital (de iure o en su defecto de facto) Población (2022) Porcentaje población Densidad (hab./km²) Superficie (km²) Porcentaje superficie Mapa PIB per cápita en € (2021)
0 1 Andalucía Sevilla NaN 17,87 % 9705 NaN 17,31 % NaN NaN
1 2 Cataluña Barcelona NaN 16,38 % 23884 NaN 6,35 % NaN NaN
2 3 Comunidad de Madrid Madrid NaN 14,24 % 84115 NaN 1,59 % NaN NaN
3 4 Comunidad Valenciana Valencia NaN 10,67 % 21698 NaN 4,60 % NaN NaN
4 5 Galicia Santiago de Compostela NaN 5,68 % 9119 NaN 5,84 % NaN NaN
5 6 Castilla y León Valladolid[nota 1]​ NaN 5,02 % 2534 NaN 18,62 % NaN NaN
6 7 País Vasco Vitoria[nota 1]​ NaN 4,67 % 30213 NaN 1,43 % NaN NaN
7 8 Canarias Las Palmas de Gran Canaria y Santa Cruz de Ten... NaN 4,58 % 30139 [nota 3]​ 1,47 % NaN NaN
8 9 Castilla-La Mancha Toledo[nota 1]​ NaN 4,32 % 2579 NaN 15,70 % NaN NaN
9 10 Región de Murcia Murcia NaN 3,20 % 13374 NaN 2,24 % NaN NaN
10 11 Aragón Zaragoza NaN 2,79 % 2790 NaN 9,43 % NaN NaN
11 12 Islas Baleares Palma NaN 2,47 % 24427 NaN 0,99 % NaN NaN
12 13 Extremadura Mérida NaN 2,23 % 2541 NaN 8,23 % NaN NaN
13 14 Principado de Asturias Oviedo NaN 2,14 % 9553 NaN 2,13 % NaN NaN
14 15 Comunidad Foral de Navarra Pamplona NaN 1,39 % 6330 NaN 2,05 % NaN NaN
15 16 Cantabria Santander NaN 1,23 % 10974 NaN 1,05 % NaN NaN
16 17 La Rioja Logroño NaN 0,67 % 6267 NaN 1,00 % NaN NaN
17 18 Melilla Melilla NaN 0,18 % 700158 NaN <0,01 % NaN NaN
18 19 Ceuta Ceuta NaN 0,18 % 417510 NaN <0,01 % NaN NaN
19 TOTAL España Madrid NaN 100 % 9367 NaN 100 % NaN
[10]:
print(type(comunidades_esp[0]))
df_comunidades_esp = comunidades_esp[0]
df_comunidades_esp.head()
<class 'pandas.core.frame.DataFrame'>
[10]:
Unnamed: 0 Nombre Capital (de iure o en su defecto de facto) Población (2022) Porcentaje población Densidad (hab./km²) Superficie (km²) Porcentaje superficie Mapa PIB per cápita en € (2021)
0 1 Andalucía Sevilla NaN 17,87 % 9705 NaN 17,31 % NaN NaN
1 2 Cataluña Barcelona NaN 16,38 % 23884 NaN 6,35 % NaN NaN
2 3 Comunidad de Madrid Madrid NaN 14,24 % 84115 NaN 1,59 % NaN NaN
3 4 Comunidad Valenciana Valencia NaN 10,67 % 21698 NaN 4,60 % NaN NaN
4 5 Galicia Santiago de Compostela NaN 5,68 % 9119 NaN 5,84 % NaN NaN
[34]:
# Contenido JSON
#Fuente: https://data.cityofnewyork.us/Environment/Urban-Park-Ranger-Animal-Condition-Response/fuhs-xmg2
url = 'https://data.cityofnewyork.us/resource/s3vf-x992.json'
df = pd.read_json(url)
df.head()
[34]:
date_and_time_of_initial date_and_time_of_ranger borough property location species_description call_source species_status animal_condition duration_of_response ... _311sr_number final_ranger_action of_animals pep_response animal_monitored police_response esu_response acc_intake_number hours_spent_monitoring rehabilitator
0 2021-06-23T16:45:00.000 2021-06-24T08:00:00.000 Brooklyn Sternberg Park Inside locked athletic field under construction Chukar Other Exotic Healthy 6.00 ... 311-06712416 ACC 6 False False False False 163537 NaN NaN
1 2021-06-24T10:00:00.000 2021-06-24T11:00:00.000 Bronx Haffen Park Haffen Pool Sparrow Central Native Healthy 1.75 ... 311-06714879 Rehabilitator 4 False False False False NaN NaN NaN
2 2021-06-23T14:30:00.000 2021-06-23T14:30:00.000 Bronx Pelham Bay Park Pelham Bay South White-tailed Deer Employee Native N/A 1.00 ... NaN Unfounded 0 False False False False NaN NaN NaN
3 2021-06-23T13:00:00.000 2021-06-23T13:10:00.000 Staten Island Willowbrook Park The carousel Raccoon Employee Native N/A 2.00 ... NaN Unfounded 0 False False False False NaN NaN NaN
4 2021-06-23T09:20:00.000 2021-06-23T09:20:00.000 Queens Judge Moses Weinstein Playground Garbage can Virginia Opossum Central Native Healthy 2.25 ... 311-06699415 ACC 1 False False False False 119833 NaN NaN

5 rows × 22 columns

Actividades

En esta actividad practicaremos la carga de datos en diferentes formatos. En el mundo real, los datos no siempre tienen una estructura y un formato como desearíamos.

El objetivo es que analices la carga de estos datos con los datos originales: - ¿Qué dimensión tienen los datos reales y los cargados? - ¿Cuáles son las columnas? - ¿El concepto de columna como atributo o característica y el concepto de fila como muestra están presentes en la estructura de los datos? - ¿Coinciden con la información del archivo?

[13]:
# 1. Descarga y carga el siguiente fichero:
# https://data.cityofnewyork.us/Housing-Development/Speculation-Watch-List/adax-9mit
[14]:
# 2. Descarga y carga el siguiente fichero:
# https://ibestat.caib.es/ibestat/estadistiques/poblacio/naixements/414cab4f-b402-4cd1-af05-6617443de384


[15]:
# 3. Y ahora Descargalo y abrelo comprimido!!!. Es decir, sin descomprimir!
# https://ec.europa.eu/eurostat/databrowser/view/tin00171/default/table?lang=en
# Los archivos comprimidos en formato .gz se pueden abrir directamente como si fueran archivos de datos con Pandas, y en este caso, son archivos del tipo CSV. Es decir, no es necesario descomprimirlos.

Guardando un dataframe

Durante este curso aprenderemos a modificar los dataframes, agregaremos y eliminaremos columnas, y también modificaremos las que ya tenemos. Por lo tanto, después de realizar este trabajo, es necesario guardar los nuevos datos en un archivo.

[ ]:
df_gastos.to_csv('data/tmp_file.csv',encoding='utf-8') # guardant un dataframe en un fitxer, especificant el format

Estructura del dataframe

Ahora que ya sabemos cargar dataframes desde archivos, descubriremos cómo podemos acceder a la información que se encuentra dentro de los tipos de variables utilizados por Pandas.

Un dataframe tiene columnas y filas. Las filas son muestras y las columnas representan características de una muestra. Una columna es del tipo Serie.

El objetivo de esta unidad es adquirir herramientas para comprender y seleccionar los datos representados en un dataframe y una serie con Pandas.

Comenzaremos seleccionando columnas y obteniendo resúmenes estadísticos de ellas. Más adelante, pasaremos a realizar selecciones de filas en el dataframe. Finalmente, realizaremos selecciones combinadas creando nuestros propios dataframes a partir de los subconjuntos seleccionados.”

[3]:
import pandas as pd

df_who = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe
[37]:
df_who.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Columns: 358 entries, Country to Urban_population_pct_of_total
dtypes: float64(355), int64(2), object(1)
memory usage: 565.1+ KB
[38]:
df_who.describe()
[38]:
CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) Population in urban areas (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
count 202.000000 202.000000 177.000000 131.000000 178.000000 179.000000 179.000000 1.930000e+02 193.000000 193.000000 ... 1.860000e+02 1.780000e+02 128.000000 1.710000e+02 181.000000 170.000000 181.000000 1.880000e+02 188.000000 188.000000
mean 101.500000 3.579208 59.457627 78.871756 11250.112360 84.033520 85.698324 3.409805e+04 1.297927 54.911917 ... 1.483596e+05 2.015567e+11 57.253516 3.424012e+08 56.677624 54.356471 56.677624 1.665763e+07 2.165851 55.195213
std 58.456537 1.808263 49.105286 20.415760 12586.753417 17.788047 15.451212 1.304957e+05 1.163864 23.554182 ... 6.133091e+05 9.400689e+11 138.669298 5.943043e+10 60.060929 61.160556 60.060929 5.094867e+07 1.596628 23.742122
min 1.000000 1.000000 0.000000 23.600000 260.000000 6.000000 11.000000 2.000000e+00 -2.500000 10.000000 ... 2.565000e+01 5.190000e+07 0.990000 -7.140000e+11 2.900000 3.000000 2.900000 1.545600e+04 -1.160000 10.000000
25% 51.250000 2.000000 19.000000 68.400000 2112.500000 79.000000 79.500000 1.340000e+03 0.500000 36.000000 ... 1.672615e+03 3.317500e+09 16.292500 -1.210000e+09 12.400000 8.475000 12.400000 9.171623e+05 1.105000 35.650000
50% 101.500000 3.000000 46.000000 86.500000 6175.000000 90.000000 90.000000 6.762000e+03 1.300000 57.000000 ... 1.021157e+04 1.145000e+10 28.515000 -2.240000e+08 29.980000 27.600000 29.980000 3.427661e+06 1.945000 57.300000
75% 151.750000 5.000000 91.000000 95.300000 14502.500000 96.000000 96.000000 2.173200e+04 2.100000 73.000000 ... 6.549217e+04 8.680000e+10 55.310000 1.024000e+09 88.700000 82.900000 88.700000 9.837113e+06 3.252500 72.750000
max 202.000000 7.000000 199.000000 99.800000 60870.000000 100.000000 100.000000 1.328474e+06 4.300000 100.000000 ... 5.776432e+06 1.100000e+13 1334.860000 1.390000e+11 267.000000 253.700000 267.000000 5.270000e+08 7.850000 100.000000

8 rows × 357 columns

Columnes

Com hem comentat a l’introducció, començarem amb les columnes

[39]:
# Columnas o características de cada muestra
print(df_who.columns)
df.index
Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
       'Adult literacy rate (%)',
       'Gross national income per capita (PPP international $)',
       'Net primary school enrolment ratio female (%)',
       'Net primary school enrolment ratio male (%)',
       'Population (in thousands) total', 'Population annual growth rate (%)',
       ...
       'Total_CO2_emissions', 'Total_income', 'Total_reserves',
       'Trade_balance_goods_and_services', 'Under_five_mortality_from_CME',
       'Under_five_mortality_from_IHME', 'Under_five_mortality_rate',
       'Urban_population', 'Urban_population_growth',
       'Urban_population_pct_of_total'],
      dtype='object', length=358)
[39]:
RangeIndex(start=0, stop=1000, step=1)
[40]:
for ix, col in enumerate(df_who.columns): #label
    print("ix:%i\tlabel:%s"%(ix,col))
ix:0    label:Country
ix:1    label:CountryID
ix:2    label:Continent
ix:3    label:Adolescent fertility rate (%)
ix:4    label:Adult literacy rate (%)
ix:5    label:Gross national income per capita (PPP international $)
ix:6    label:Net primary school enrolment ratio female (%)
ix:7    label:Net primary school enrolment ratio male (%)
ix:8    label:Population (in thousands) total
ix:9    label:Population annual growth rate (%)
ix:10   label:Population in urban areas (%)
ix:11   label:Population living below the poverty line (% living on &lt; US$1 per day)
ix:12   label:Population median age (years)
ix:13   label:Population proportion over 60 (%)
ix:14   label:Population proportion under 15 (%)
ix:15   label:Registration coverage of births (%)
ix:16   label:Total fertility rate (per woman)
ix:17   label:Antenatal care coverage - at least four visits (%)
ix:18   label:Antiretroviral therapy coverage among HIV-infected pregt women for PMTCT (%)
ix:19   label:Antiretroviral therapy coverage among people with advanced HIV infections (%)
ix:20   label:Births attended by skilled health personnel (%)
ix:21   label:Births by caesarean section (%)
ix:22   label:Children aged 6-59 months who received vitamin A supplementation (%)
ix:23   label:Children aged &lt;5 years sleeping under insecticide-treated nets (%)
ix:24   label:Children aged &lt;5 years who received any antimalarial treatment for fever (%)
ix:25   label:Children aged &lt;5 years with ARI symptoms taken to facility (%)
ix:26   label:Children aged &lt;5 years with diarrhoea receiving ORT (%)
ix:27   label:Contraceptive prevalence (%)
ix:28   label:Neonates protected at birth against neonatal tetanus (PAB) (%)
ix:29   label:One-year-olds immunized with MCV
ix:30   label:One-year-olds immunized with three doses of diphtheria tetanus toxoid and pertussis (DTP3) (%)
ix:31   label:One-year-olds immunized with three doses of Hepatitis B (HepB3) (%)
ix:32   label:One-year-olds immunized with three doses of Hib (Hib3) vaccine (%)
ix:33   label:Tuberculosis detection rate under DOTS (%)
ix:34   label:Tuberculosis treatment success under DOTS (%)
ix:35   label:Women who have had mammography (%)
ix:36   label:Women who have had PAP smear (%)
ix:37   label:Community and traditional health workers density (per 10 000 population)
ix:38   label:Dentistry personnel density (per 10 000 population)
ix:39   label:Environment and public health workers density (per 10 000 population)
ix:40   label:External resources for health as percentage of total expenditure on health
ix:41   label:General government expenditure on health as percentage of total expenditure on health
ix:42   label:General government expenditure on health as percentage of total government expenditure
ix:43   label:Hospital beds (per 10 000 population)
ix:44   label:Laboratory health workers density (per 10 000 population)
ix:45   label:Number of community and traditional health workers
ix:46   label:Number of dentistry personnel
ix:47   label:Number of environment and public health workers
ix:48   label:Number of laboratory health workers
ix:49   label:Number of nursing and midwifery personnel
ix:50   label:Number of other health service providers
ix:51   label:Number of pharmaceutical personnel
ix:52   label:Number of physicians
ix:53   label:Nursing and midwifery personnel density (per 10 000 population)
ix:54   label:Other health service providers density (per 10 000 population)
ix:55   label:Out-of-pocket expenditure as percentage of private expenditure on health
ix:56   label:Per capita government expenditure on health (PPP int. $)
ix:57   label:Per capita government expenditure on health at average exchange rate (US$)
ix:58   label:Per capita total expenditure on health (PPP int. $)
ix:59   label:Per capita total expenditure on health at average exchange rate (US$)
ix:60   label:Pharmaceutical personnel density (per 10 000 population)
ix:61   label:Physicians density (per 10 000 population)
ix:62   label:Private expenditure on health as percentage of total expenditure on health
ix:63   label:Private prepaid plans as percentage of private expenditure on health
ix:64   label:Ratio of health management and support workers to health service providers
ix:65   label:Ratio of nurses and midwives to physicians
ix:66   label:Social security expenditure on health as percentage of general government expenditure on health
ix:67   label:Total expenditure on health as percentage of gross domestic product
ix:68   label:Births attended by skilled health personnel (%) highest educational level of mother
ix:69   label:Births attended by skilled health personnel (%) highest wealth quintile
ix:70   label:Births attended by skilled health personnel (%) lowest educational level of mother
ix:71   label:Births attended by skilled health personnel (%) lowest wealth quintile
ix:72   label:Births attended by skilled health personnel (%) rural
ix:73   label:Births attended by skilled health personnel (%) urban
ix:74   label:Births attended by skilled health personnel difference highest lowest educational level of mother
ix:75   label:Births attended by skilled health personnel difference highest-lowest wealth quintile
ix:76   label:Births attended by skilled health personnel difference urban-rural
ix:77   label:Births attended by skilled health personnel ratio highest-lowest educational level of mother
ix:78   label:Births attended by skilled health personnel ratio highest-lowest wealth quintile
ix:79   label:Births attended by skilled health personnel ratio urban-rural
ix:80   label:Measles immunization coverage among one-year-olds (%) highest educational level of mother
ix:81   label:Measles immunization coverage among one-year-olds (%) highest wealth quintile
ix:82   label:Measles immunization coverage among one-year-olds (%) lowest educational level of mother
ix:83   label:Measles immunization coverage among one-year-olds (%) lowest wealth quintile
ix:84   label:Measles immunization coverage among one-year-olds (%) rural
ix:85   label:Measles immunization coverage among one-year-olds (%) urban
ix:86   label:Measles immunization coverage among one-year-olds difference highest-lowest educational level of mother
ix:87   label:Measles immunization coverage among one-year-olds difference highest-lowest wealth quintile
ix:88   label:Measles immunization coverage among one-year-olds difference urban-rural
ix:89   label:Measles immunization coverage among one-year-olds ratio highest-lowest educational level of mother
ix:90   label:Measles immunization coverage among one-year-olds ratio highest-lowest wealth quintile
ix:91   label:Measles immunization coverage among one-year-olds ratio urban-rural
ix:92   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) difference lowest-highest educational level of mother
ix:93   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) difference lowest-highest wealth quintile
ix:94   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) difference rural-urban
ix:95   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) highest educational level of mother
ix:96   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) highest wealth quintile
ix:97   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) lowest educational level of mother
ix:98   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) lowest wealth quintile
ix:99   label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) ratio lowest-highest educational level of mother
ix:100  label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) ratio lowest-highest wealth quintile
ix:101  label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) ratio rural-urban
ix:102  label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) rural
ix:103  label:Under-5 mortality rate (Probability of dying aged &lt; 5 years per 1 000 live births) urban
ix:104  label:Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) both sexes
ix:105  label:Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) female
ix:106  label:Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) male
ix:107  label:Age-standardized mortality rate for cancer (per 100 000 population)
ix:108  label:Age-standardized mortality rate for cardiovascular diseases (per 100 000 population)
ix:109  label:Age-standardized mortality rate for injuries (per 100 000 population)
ix:110  label:Age-standardized mortality rate for non-communicable diseases (per 100 000 population)
ix:111  label:Deaths among children under five years of age due to diarrhoeal diseases (%)
ix:112  label:Deaths among children under five years of age due to HIV/AIDS (%)
ix:113  label:Deaths among children under five years of age due to injuries (%)
ix:114  label:Deaths among children under five years of age due to malaria (%)
ix:115  label:Deaths among children under five years of age due to measles (%)
ix:116  label:Deaths among children under five years of age due to neonatal causes (%)
ix:117  label:Deaths among children under five years of age due to other causes (%)
ix:118  label:Deaths among children under five years of age due to pneumonia (%)
ix:119  label:Deaths due to HIV/AIDS (per 100 000 population per year)
ix:120  label:Deaths due to tuberculosis among HIV-negative people (per 100 000 population)
ix:121  label:Deaths due to tuberculosis among HIV-positive people (per 100 000 population)
ix:122  label:Healthy life expectancy (HALE) at birth (years) both sexes
ix:123  label:Healthy life expectancy (HALE) at birth (years) female
ix:124  label:Healthy life expectancy (HALE) at birth (years) male
ix:125  label:Incidence of tuberculosis (per 100 000 population per year)
ix:126  label:Infant mortality rate (per 1 000 live births) both sexes
ix:127  label:Infant mortality rate (per 1 000 live births) female
ix:128  label:Infant mortality rate (per 1 000 live births) male
ix:129  label:Life expectancy at birth (years) both sexes
ix:130  label:Life expectancy at birth (years) female
ix:131  label:Life expectancy at birth (years) male
ix:132  label:Maternal mortality ratio (per 100 000 live births)
ix:133  label:Neonatal mortality rate (per 1 000 live births)
ix:134  label:Number of confirmed poliomyelitis cases
ix:135  label:Prevalence of HIV among adults aged &gt;=15 years (per 100 000 population)
ix:136  label:Prevalence of tuberculosis (per 100 000 population)
ix:137  label:Under-5 mortality rate (probability of dying by age 5 per 1000 live births) both sexes
ix:138  label:Under-5 mortality rate (probability of dying by age 5 per 1000 live births) female
ix:139  label:Under-5 mortality rate (probability of dying by age 5 per 1000 live births) male
ix:140  label:Years of life lost to communicable diseases (%)
ix:141  label:Years of life lost to injuries (%)
ix:142  label:Years of life lost to non-communicable diseases (%)
ix:143  label:Children under five years of age overweight for age (%)
ix:144  label:Children under five years of age stunted for age (%)
ix:145  label:Children under five years of age underweight for age (%)
ix:146  label:Newborns with low birth weight (%)
ix:147  label:Per capita recorded alcohol consumption (litres of pure alcohol) among adults (&gt;=15 years)
ix:148  label:Population using solid fuels (%) rural
ix:149  label:Population using solid fuels (%) urban
ix:150  label:Population with sustainable access to improved drinking water sources (%) rural
ix:151  label:Population with sustainable access to improved drinking water sources (%) total
ix:152  label:Population with sustainable access to improved drinking water sources (%) urban
ix:153  label:Population with sustainable access to improved sanitation (%) rural
ix:154  label:Population with sustainable access to improved sanitation (%) total
ix:155  label:Population with sustainable access to improved sanitation (%) urban
ix:156  label:Prevalence of adults (&gt;=15 years) who are obese (%) female
ix:157  label:Prevalence of adults (&gt;=15 years) who are obese (%) male
ix:158  label:Prevalence of condom use by young people (15-24 years) at higher risk sex (%) female
ix:159  label:Prevalence of condom use by young people (15-24 years) at higher risk sex (%) male
ix:160  label:Prevalence of current tobacco use among adolescents (13-15 years) (%) both sexes
ix:161  label:Prevalence of current tobacco use among adolescents (13-15 years) (%) female
ix:162  label:Prevalence of current tobacco use among adolescents (13-15 years) (%) male
ix:163  label:Prevalence of current tobacco use among adults (&gt;=15 years) (%) both sexes
ix:164  label:Prevalence of current tobacco use among adults (&gt;=15 years) (%) female
ix:165  label:Prevalence of current tobacco use among adults (&gt;=15 years) (%) male
ix:166  label:Adolescent_fertility_rate
ix:167  label:Agricultural_land
ix:168  label:Agriculture_contribution_to_economy
ix:169  label:Aid_given
ix:170  label:Aid_received
ix:171  label:Aid_received_total
ix:172  label:All_forms_of_TB_new_cases_per_100_000_estimated
ix:173  label:All_forms_of_TB_new_cases_per_100_000_reported
ix:174  label:Annual_freshwater_withdrawals_total
ix:175  label:Arms_exports
ix:176  label:Arms_imports
ix:177  label:Bad_teeth_per_child
ix:178  label:Births_attended_by_skilled_health_staff
ix:179  label:Breast_cancer_deaths_per_100_000_women
ix:180  label:Breast_cancer_new_cases_per_100_000_women
ix:181  label:Breast_cancer_number_of_female_deaths
ix:182  label:Breast_cancer_number_of_new_female_cases
ix:183  label:Broadband_subscribers
ix:184  label:Broadband_subscribers_per_100_people
ix:185  label:CO2_emissions
ix:186  label:CO2_intensity_of_economic_output
ix:187  label:Capital_formation
ix:188  label:Cell_phones_per_100_people
ix:189  label:Cell_phones_total
ix:190  label:Central_bank_discount_rate
ix:191  label:Cervical_cancer_deaths_per_100_000_women
ix:192  label:Cervical_cancer_new_cases_per_100_000_women
ix:193  label:Cervical_cancer_number_of_female_deaths
ix:194  label:Cervical_cancer_number_of_new_female_cases
ix:195  label:Children_and_elderly
ix:196  label:Children_out_of_school_primary
ix:197  label:Children_out_of_school_primary_female
ix:198  label:Children_out_of_school_primary_male
ix:199  label:Children_per_woman
ix:200  label:Coal_consumption
ix:201  label:Coal_consumption_per_person
ix:202  label:Coal_production
ix:203  label:Coal_production_per_person
ix:204  label:Colon_and_Rectum_cancer_deaths_per_100_000_men
ix:205  label:Colon_and_Rectum_cancer_deaths_per_100_000_women
ix:206  label:Colon_and_Rectum_cancer_new_cases_per_100_000_men
ix:207  label:Colon_and_Rectum_cancer_new_cases_per_100_000_women
ix:208  label:Colon_and_Rectum_cancer_number_of_female_deaths
ix:209  label:Colon_and_Rectum_cancer_number_of_male_deaths
ix:210  label:Colon_and_Rectum_cancer_number_of_new_female_cases
ix:211  label:Colon_and_Rectum_cancer_number_of_new_male_cases
ix:212  label:Consumer_price_index
ix:213  label:Contraceptive_use
ix:214  label:Deaths_from_TB_per_100_000_estimated
ix:215  label:Debt_servicing_costs
ix:216  label:Democracy_score
ix:217  label:Electric_power_consumption
ix:218  label:Electricity_generation
ix:219  label:Electricity_generation_per_person
ix:220  label:Energy_use
ix:221  label:Expenditure_per_student_primary
ix:222  label:Expenditure_per_student_secondary
ix:223  label:Expenditure_per_student_tertiary
ix:224  label:Exports_of_goods_and_services
ix:225  label:Exports_unit_value
ix:226  label:External_debt_total_DOD_current_USdollars
ix:227  label:External_debt_total_pct_of_GNI
ix:228  label:Female_labour_force
ix:229  label:Fixed_line_and_mobile_phone_subscribers
ix:230  label:Foreign_direct_investment_net_inflows
ix:231  label:Foreign_direct_investment_net_outflows
ix:232  label:Forest_area
ix:233  label:Gross_capital_formation
ix:234  label:HIV_infected
ix:235  label:Health_expenditure_per_person
ix:236  label:Health_expenditure_private
ix:237  label:Health_expenditure_public_pct_of_GDP
ix:238  label:Health_expenditure_public_pct_of_government_expenditure
ix:239  label:Health_expenditure_public_pct_of_total_health_expenditure
ix:240  label:Health_expenditure_total
ix:241  label:High_technology_exports
ix:242  label:Hydroelectricity_consumption
ix:243  label:Hydroelectricity_consumption_per_person
ix:244  label:Imports_of_goods_and_services
ix:245  label:Imports_unit_value
ix:246  label:Improved_sanitation_facilities_urban
ix:247  label:Improved_water_source
ix:248  label:Income_growth
ix:249  label:Income_per_person
ix:250  label:Income_share_held_by_lowest_20pct
ix:251  label:Industry_contribution_to_economy
ix:252  label:Inequality_index
ix:253  label:Infant_mortality_rate
ix:254  label:Infectious_TB_new_cases_per_100_000_estimated
ix:255  label:Infectious_TB_new_cases_per_100_000_reported
ix:256  label:Infectious_TB_treatment_completeness
ix:257  label:Inflation_GDP_deflator
ix:258  label:Internet_users
ix:259  label:Life_expectancy_at_birth
ix:260  label:Literacy_rate_adult_female
ix:261  label:Literacy_rate_adult_male
ix:262  label:Literacy_rate_adult_total
ix:263  label:Literacy_rate_youth_female
ix:264  label:Literacy_rate_youth_male
ix:265  label:Literacy_rate_youth_total
ix:266  label:Liver_cancer_deaths_per_100_000_men
ix:267  label:Liver_cancer_deaths_per_100_000_women
ix:268  label:Liver_cancer_new_cases_per_100_000_men
ix:269  label:Liver_cancer_new_cases_per_100_000_women
ix:270  label:Liver_cancer_number_of_female_deaths
ix:271  label:Liver_cancer_number_of_male_deaths
ix:272  label:Liver_cancer_number_of_new_female_cases
ix:273  label:Liver_cancer_number_of_new_male_cases
ix:274  label:Lung_cancer_deaths_per_100_000_men
ix:275  label:Lung_cancer_deaths_per_100_000_women
ix:276  label:Lung_cancer_new_cases_per_100_000_men
ix:277  label:Lung_cancer_new_cases_per_100_000_women
ix:278  label:Lung_cancer_number_of_female_deaths
ix:279  label:Lung_cancer_number_of_male_deaths
ix:280  label:Lung_cancer_number_of_new_female_cases
ix:281  label:Lung_cancer_number_of_new_male_cases
ix:282  label:Malaria_prevention_insecticide_treated_bed_nets_usage
ix:283  label:Malaria_treatment
ix:284  label:Malnutrition_weight_for_age
ix:285  label:Market_value_of_listed_companies
ix:286  label:Maternal_mortality
ix:287  label:Math_achievement_4th_grade
ix:288  label:Math_achievement_8th_grade
ix:289  label:Measles_immunization
ix:290  label:Medical_Doctors
ix:291  label:Merchandise_trade
ix:292  label:Military_expenditure
ix:293  label:Natural_gas_consumption
ix:294  label:Natural_gas_consumption_per_person
ix:295  label:Natural_gas_production
ix:296  label:Natural_gas_production_per_person
ix:297  label:Natural_gas_proved_reserves
ix:298  label:Natural_gas_proven_reserves_per_person
ix:299  label:Net_barter_terms_of_trade
ix:300  label:Nuclear_consumption
ix:301  label:Nuclear_consumption_per_person
ix:302  label:Number_of_deaths_from_TB_estimated
ix:303  label:Number_of_existing_TB_cases_estimated
ix:304  label:Oil_consumption
ix:305  label:Oil_consumption_per_person
ix:306  label:Oil_production
ix:307  label:Oil_production_per_person
ix:308  label:Oil_proved_reserves
ix:309  label:Oil_proven_reserves_per_person
ix:310  label:Old_version_of_Income_per_person
ix:311  label:Patent_applications
ix:312  label:Patents_granted
ix:313  label:Patents_in_force
ix:314  label:People_living_with_HIV
ix:315  label:Personal_computers_per_100_people
ix:316  label:Personal_computers_total
ix:317  label:Population_growth
ix:318  label:Population_in_urban_agglomerations_more_than_1_million
ix:319  label:Population_total
ix:320  label:Poverty_headcount_ratio_at_national_poverty_line
ix:321  label:Present_value_of_debt
ix:322  label:Primary_completion_rate_total
ix:323  label:Primary_energy_consumption
ix:324  label:Primary_energy_consumption_per_person
ix:325  label:Primary_school_completion_pct_of_boys
ix:326  label:Primary_school_completion_pct_of_girls
ix:327  label:Prostate_cancer_deaths_per_100_000_men
ix:328  label:Prostate_cancer_new_cases_per_100_000_men
ix:329  label:Prostate_cancer_number_of_male_deaths
ix:330  label:Prostate_cancer_number_of_new_male_cases
ix:331  label:Pump_price_for_gasoline
ix:332  label:Ratio_of_girls_to_boys_in_primary_and_secondary_education
ix:333  label:Ratio_of_young_literate_females_to_males
ix:334  label:Roads_paved
ix:335  label:SO2_emissions_per_person
ix:336  label:Services_contribution_to_economy
ix:337  label:Stomach_cancer_deaths_per_100_000_men
ix:338  label:Stomach_cancer_deaths_per_100_000_women
ix:339  label:Stomach_cancer_new_cases_per_100_000_men
ix:340  label:Stomach_cancer_new_cases_per_100_000_women
ix:341  label:Stomach_cancer_number_of_female_deaths
ix:342  label:Stomach_cancer_number_of_male_deaths
ix:343  label:Stomach_cancer_number_of_new_female_cases
ix:344  label:Stomach_cancer_number_of_new_male_cases
ix:345  label:Sugar_per_person
ix:346  label:Surface_area
ix:347  label:Tax_revenue
ix:348  label:Total_CO2_emissions
ix:349  label:Total_income
ix:350  label:Total_reserves
ix:351  label:Trade_balance_goods_and_services
ix:352  label:Under_five_mortality_from_CME
ix:353  label:Under_five_mortality_from_IHME
ix:354  label:Under_five_mortality_rate
ix:355  label:Urban_population
ix:356  label:Urban_population_growth
ix:357  label:Urban_population_pct_of_total

Si inspeccionamos y comparamos los tipos del dataframe y de las columnas…

[41]:
type(df_who)
[41]:
pandas.core.frame.DataFrame
[42]:
type(df_who.columns)
[42]:
pandas.core.indexes.base.Index

Podemos utilizar el nombre de una columna para obtener los datos de dicha columna, tal como lo hacíamos con un diccionario python. Veremos dos maneras diferentes de hacerlo:

[44]:
paises = df_who["Country"]
print(paises[0:10])
0            Afghanistan
1                Albania
2                Algeria
3                Andorra
4                 Angola
5    Antigua and Barbuda
6              Argentina
7                Armenia
8              Australia
9                Austria
Name: Country, dtype: object

¿Qué tipo de datos es una columna?

[45]:
type(df_who["Country"])
[45]:
pandas.core.series.Series

Las Series són la otra estructura básica de Pandas. Las filas y las columnas se estructuran en Series, se pueden ver cómo un tipo de lista que solamente puede contener un único tipo de datos, acepta operaciones vectoriales y se puede indexar de manera similar a un diccionario.

Una vez que seleccionamos una columna, podemos acceder a sus elementos como si fueran una lista:

[46]:
print(df_who["Country"][0])
print("-"*30)
print(df_who["Country"][:5]) # slicing
print("-"*30)
print(df_who["Country"].values)
Afghanistan
------------------------------
0    Afghanistan
1        Albania
2        Algeria
3        Andorra
4         Angola
Name: Country, dtype: object
------------------------------
['Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola'
 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Australia' 'Austria'
 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
 'Belgium' 'Belize' 'Benin' 'Bermuda' 'Bhutan' 'Bolivia'
 'Bosnia and Herzegovina' 'Botswana' 'Brazil' 'Brunei Darussalam'
 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia' 'Cameroon' 'Canada'
 'Cape Verde' 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia'
 'Comoros' 'Congo, Dem. Rep.' 'Congo, Rep.' 'Cook Islands' 'Costa Rica'
 "Cote d'Ivoire" 'Croatia' 'Cuba' 'Cyprus' 'Czech Republic' 'Denmark'
 'Djibouti' 'Dominica' 'Dominican Republic' 'Ecuador' 'Egypt'
 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Ethiopia' 'Fiji'
 'Finland' 'France' 'French Polynesia' 'Gabon' 'Gambia' 'Georgia'
 'Germany' 'Ghana' 'Greece' 'Grenada' 'Guatemala' 'Guinea' 'Guinea-Bissau'
 'Guyana' 'Haiti' 'Honduras' 'Hong Kong, China' 'Hungary' 'Iceland'
 'India' 'Indonesia' 'Iran (Islamic Republic of)' 'Iraq' 'Ireland'
 'Israel' 'Italy' 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya'
 'Kiribati' 'Korea, Dem. Rep.' 'Korea, Rep.' 'Kuwait' 'Kyrgyzstan'
 "Lao People's Democratic Republic" 'Latvia' 'Lebanon' 'Lesotho' 'Liberia'
 'Libyan Arab Jamahiriya' 'Lithuania' 'Luxembourg' 'Macao, China'
 'Macedonia' 'Madagascar' 'Malawi' 'Malaysia' 'Maldives' 'Mali' 'Malta'
 'Marshall Islands' 'Mauritania' 'Mauritius' 'Mexico'
 'Micronesia (Federated States of)' 'Moldova' 'Monaco' 'Mongolia'
 'Montenegro' 'Morocco' 'Mozambique' 'Myanmar' 'Namibia' 'Nauru' 'Nepal'
 'Netherlands' 'Netherlands Antilles' 'New Caledonia' 'New Zealand'
 'Nicaragua' 'Niger' 'Nigeria' 'Niue' 'Norway' 'Oman' 'Pakistan' 'Palau'
 'Panama' 'Papua New Guinea' 'Paraguay' 'Peru' 'Philippines' 'Poland'
 'Portugal' 'Puerto Rico' 'Qatar' 'Romania' 'Russia' 'Rwanda'
 'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and the Grenadines'
 'Samoa' 'San Marino' 'Sao Tome and Principe' 'Saudi Arabia' 'Senegal'
 'Serbia' 'Seychelles' 'Sierra Leone' 'Singapore' 'Slovakia' 'Slovenia'
 'Solomon Islands' 'Somalia' 'South Africa' 'Spain' 'Sri Lanka' 'Sudan'
 'Suriname' 'Swaziland' 'Sweden' 'Switzerland' 'Syria' 'Taiwan'
 'Tajikistan' 'Tanzania' 'Thailand' 'Timor-Leste' 'Togo' 'Tonga'
 'Trinidad and Tobago' 'Tunisia' 'Turkey' 'Turkmenistan' 'Tuvalu' 'Uganda'
 'Ukraine' 'United Arab Emirates' 'United Kingdom'
 'United States of America' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela'
 'Vietnam' 'West Bank and Gaza' 'Yemen' 'Zambia' 'Zimbabwe']

Existe otra manera más sencilla de seleccionar una única columna, pero existen nombres de columna muy largos: ‘Children aged <5 years who received any antimalarial treatment for fever (%)’.

Nota: En la creación de documentos, es importante usar nombres adecuados para las columnas.

[47]:
df_who.Country
[47]:
0             Afghanistan
1                 Albania
2                 Algeria
3                 Andorra
4                  Angola
              ...
197               Vietnam
198    West Bank and Gaza
199                 Yemen
200                Zambia
201              Zimbabwe
Name: Country, Length: 202, dtype: object
[21]:
print(df_who.columns[9])
print("-"*30)
print(df_who[df_who.columns[9]])
Population annual growth rate (%)
------------------------------
0      4.0
1      0.6
2      1.5
3      1.0
4      2.8
      ...
197    1.4
198    NaN
199    3.0
200    1.9
201    0.8
Name: Population annual growth rate (%), Length: 202, dtype: float64
[23]:
# Multiples columnas
df_who[df_who.columns[0:5]]
[23]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%)
0 Afghanistan 1 1 151.0 28.0
1 Albania 2 2 27.0 98.7
2 Algeria 3 3 6.0 69.9
3 Andorra 4 2 NaN NaN
4 Angola 5 3 146.0 67.4
... ... ... ... ... ...
197 Vietnam 198 6 25.0 90.3
198 West Bank and Gaza 199 1 NaN NaN
199 Yemen 200 1 83.0 54.1
200 Zambia 201 3 161.0 68.0
201 Zimbabwe 202 3 101.0 89.5

202 rows × 5 columns

[24]:
# Dos columnas específicas
df_who[["CountryID","Continent"]]
[24]:
CountryID Continent
0 1 1
1 2 2
2 3 3
3 4 2
4 5 3
... ... ...
197 198 6
198 199 1
199 200 1
200 201 3
201 202 3

202 rows × 2 columns

[48]:
# Existen dos métodos loc e iloc para acceder a sub-regiones de los datos
# funció: .iloc(filas, columnas)
df2 = df_who.iloc[:,3:5]
print(df2.shape)
print(df2.head())
(202, 2)
   Adolescent fertility rate (%)  Adult literacy rate (%)
0                          151.0                     28.0
1                           27.0                     98.7
2                            6.0                     69.9
3                            NaN                      NaN
4                          146.0                     67.4

Sobre Series

Cuando seleccionamos una columna de un DataFrame, obtenemos una Serie. Las Series tienen ciertas características, como la capacidad de aplicar métodos estadísticos (si son Series numéricas).

[6]:
fertilitat = df_who[df_who.columns[3]]
print("Min ", fertilitat.min()) # a Pandas el concepte d'iterar "no té sentit"
print("Max ", fertilitat.max())
print("Count ", fertilitat.count())

Min  0.0
Max  199.0
Count  177

Tabla con las funciones descriptivas 069a78d3e8b641b1a5fdb990b1fb9e02

Veremos que obtener esta información estadística nos puede ayudar a extraer información muy concreta de la tabla, por ejemplo, si queremos saber:

¿Qué país tiene la mayor emisión de CO2?

[50]:
co2 = df_who["Total_CO2_emissions"]
co2
[50]:
0         692.50
1        3499.12
2      137535.56
3            NaN
4        8991.46
         ...
197    101826.23
198       655.86
199     20148.34
200      2366.94
201     11457.33
Name: Total_CO2_emissions, Length: 202, dtype: float64
[51]:
co2.max()
[51]:
5776431.5
[55]:
co2==co2.max()
[55]:
0      False
1      False
2      False
3      False
4      False
       ...
197    False
198    False
199    False
200    False
201    False
Name: Total_CO2_emissions, Length: 202, dtype: bool
[56]:
df_who[co2==co2.max()]
[56]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
192 United States of America 193 4 43.0 NaN 44070.0 93.0 91.0 302841.0 1.0 ... 5776431.5 1.100000e+13 NaN -7.140000e+11 8.0 7.1 8.0 240000000.0 1.39 80.8

1 rows × 358 columns

[57]:
df_who[co2==co2.max()]["Country"]
[57]:
192    United States of America
Name: Country, dtype: object
[58]:
df_who[co2==co2.max()]["Country"].values
[58]:
array(['United States of America'], dtype=object)
[59]:
df_who[co2==co2.max()]["Country"].values[0]
[59]:
'United States of America'
[64]:
# También existen métodos para seleccionar de manera aleatoria muestras dentro de una serie
# Tip: Métodos montecarlo
fertilidad = df_who[df_who.columns[3]]
some = fertilidad.sample(n=3)
print(some)
fertilidad.sample(n=3,random_state=2) #on random_state és la llavor/seed del aleatori
112     NaN
24     71.0
15     22.0
Name: Adolescent fertility rate (%), dtype: float64
[64]:
53      48.0
149     28.0
13     135.0
Name: Adolescent fertility rate (%), dtype: float64

Filas

Cada fila tiene un índice. El índice puede ser numérico, alfabético o temporal.

[65]:
df_who.index
[65]:
RangeIndex(start=0, stop=202, step=1)
[66]:
df_who.index.values
[66]:
array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199, 200, 201])

Así como seleccionamos columnas, podemos seleccionar información con para obtener filas. Para realizar la consulta de una fila concreta usaremos el atributo loc de los dataframes.

[1]:
import pandas as pd

df = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe
[2]:

print(df.loc[0])
Country                           Afghanistan
CountryID                                   1
Continent                                   1
Adolescent fertility rate (%)           151.0
Adult literacy rate (%)                  28.0
                                     ...
Under_five_mortality_from_IHME          231.9
Under_five_mortality_rate               257.0
Urban_population                    5740436.0
Urban_population_growth                  5.44
Urban_population_pct_of_total            22.9
Name: 0, Length: 358, dtype: object

Si lo que necesitamos es obtener son los valores, necesitaremos el atributo values:

[3]:
df.loc[0].values
[3]:
array(['Afghanistan', 1, 1, 151.0, 28.0, nan, nan, nan, 26088.0, 4.0,
       23.0, nan, 16.0, 4.0, 47.0, 6.0, 7.2, nan, nan, nan, 14.0, nan,
       nan, nan, nan, nan, nan, 10.3, 73.0, 70.0, 83.0, 83.0, nan, 66.0,
       90.0, nan, nan, nan, nan, nan, 20.1, 27.5, 4.4, 4.0, nan, nan,
       900.0, nan, nan, 14930.0, nan, 900.0, 5970.0, 5.0, nan, 97.2, 8.0,
       6.0, 29.0, 23.0, nan, 2.0, 72.5, 0.0, nan, 2.5, 0.0, 5.4, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, 473.0, 443.0, 500.0, 153.0,
       706.0, 134.0, 1269.0, 18.9, 0.3, 1.1, 1.0, 5.9, 26.0, 22.1, 24.8,
       10.0, 32.0, 0.0, 36.0, 36.0, 35.0, 161.0, 165.0, 154.0, 176.0,
       42.0, 43.0, 42.0, 1800.0, 60.0, 16.0, 100.0, 231.0, 257.0, 254.0,
       260.0, 76.0, 6.0, 18.0, 4.6, 59.3, 32.9, nan, 0.01, nan, nan, 17.0,
       22.0, 37.0, 25.0, 30.0, 45.0, nan, nan, nan, nan, 9.8, 3.2, 13.1,
       nan, nan, nan, nan, 58.35, 36.1, nan, 108.83, 2750000000.0, 168.0,
       87.0, 42.29, 0.0, 28000000.0, 2.9, 14.3, 11.7, 26.8, 874.0, 2021.0,
       220.0, 0.000878, 0.02, 0.04, 62300000000.0, 4.0, 600000.0, nan,
       3.6, 6.9, 254.0, 511.0, 96.8, nan, nan, 1792633.0, 7.07, nan, nan,
       nan, nan, 3.3, 2.8, 5.2, 4.5, 193.0, 236.0, 316.0, 377.0, nan,
       10.3, 33.0, nan, -7.0, nan, nan, nan, nan, nan, nan, nan, 12.43,
       nan, nan, nan, nan, 5.19, 0.0, nan, 8670.0, 25.05, nan, 20.0, 4.16,
       1.04, 3.3, 20.0, 5.2, nan, nan, nan, 55.66, nan, 49.0, 39.0, nan,
       874.0, nan, 24.48, nan, 165.0, 76.0, 40.0, 90.0, 11.9, 1.0, 43.4,
       12.59, 43.14, 28.0, 18.39, 50.81, 34.26, 3.5, 2.3, 3.7, 2.5, 147.0,
       218.0, 155.0, 233.0, 11.3, 2.7, 12.2, 2.9, 173.0, 675.0, 190.0,
       732.0, nan, nan, nan, nan, 1900.0, nan, nan, 64.0, 0.19, 39.42,
       9.93, nan, nan, nan, nan, nan, nan, nan, nan, nan, 8242.0, 66826.0,
       nan, nan, nan, nan, nan, nan, 717.04, nan, nan, nan, nan, nan, nan,
       nan, nan, 29900000.0, nan, nan, 37.73, nan, nan, nan, nan, 2.8,
       4.5, 151.0, 249.0, 0.68, 55.57, 36.2, 23.66, 3.14, 39.42, 15.8,
       8.3, 18.5, 9.7, 499.0, 936.0, 592.0, 1108.0, nan, 652090.0, nan,
       692.5, nan, nan, nan, 257.0, 231.9, 257.0, 5740436.0, 5.44, 22.9],
      dtype=object)
[35]:
print(df_who.loc[0].Country)
print(df_who.loc[0][0])
print(df_who.loc[0][3])
Afghanistan
Afghanistan
151.0

Utilizando el atributo loc del dataframe podemos seleccionar y filtrar las filas (y columnas) mediante labels. En el caso de filas, el label es el índice y si éste es númerico podemos usar los slicing típicos de Python.

.loc(row_labels,columns_labels)

Recordemos el slicing:

sublista = lista[start:stop:step]

Dónde: * start: Posición de la lista original dónde empieza la sublista. Si no se indica és 0. * stop: Posición de la lista original hasta donde seleccionar. Se selecciona hasta la posición stop - 1. * step: Incremento entre cada índice de la selección, por defecto 1.

Si entendemos el concepto para un array…

[22]:
array =[1,2,3,4,5,6,7,9,0]
print(array[0:2]) #**
print(array[3:]) #**
print(array[:3]) #**
[1, 2]
[4, 5, 6, 7, 9, 0]
[1, 2, 3]

… podemos hacer las mismas operaciones con las filas de un dataFrame.

[4]:
df.loc[4:10:2]
[4]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
4 Angola 5 3 146.0 67.4 3890.0 49.0 51.0 16557.0 2.8 ... 8991.46 1.490000e+10 27.13 9.140000e+09 164.1 242.5 164.1 8578749.0 4.14 53.3
6 Argentina 7 5 62.0 97.2 11670.0 98.0 99.0 39134.0 1.0 ... 152711.86 3.140000e+11 21.11 1.190000e+10 18.1 16.7 18.1 34900000.0 1.17 90.1
8 Australia 9 6 16.0 NaN 33940.0 97.0 96.0 20530.0 1.1 ... 368858.53 4.680000e+11 NaN -1.280000e+10 5.9 5.1 5.9 18000000.0 1.54 88.2
10 Azerbaijan 11 2 31.0 98.8 5430.0 83.0 86.0 8406.0 0.6 ... 36629.01 9.930000e+09 64.89 1.330000e+09 50.0 63.7 50.0 4321803.0 1.26 51.5

4 rows × 358 columns

También se puede realizar una selección particular mediante una lista:

[5]:
df.loc[[3,10,29,34]]
[5]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
3 Andorra 4 2 NaN NaN NaN 83.0 83.0 74.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 Azerbaijan 11 2 31.0 98.8 5430.0 83.0 86.0 8406.0 0.6 ... 36629.01 9.930000e+09 64.89 1.330000e+09 50.0 63.7 50.0 4321803.0 1.26 51.5
29 Cambodia 30 7 52.0 73.6 1550.0 89.0 91.0 14197.0 1.7 ... 538.61 5.680000e+09 32.94 -5.470000e+08 97.3 94.3 97.3 2749235.0 4.58 19.7
34 Chad 35 3 193.0 25.7 1170.0 49.0 71.0 10468.0 3.1 ... 139.23 2.790000e+09 14.17 -2.210000e+08 208.2 180.1 208.2 2566839.0 4.88 25.3

4 rows × 358 columns

Selección de filas y columnas

Si seguimos con la misma lógica, usando el atributo loc de los dataFrames.

[6]:
import pandas as pd

df = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe
[11]:
df.loc[0:1]
[11]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
0 Afghanistan 1 1 151.0 28.0 NaN NaN NaN 26088.0 4.0 ... 692.50 NaN NaN NaN 257.00 231.9 257.00 5740436.0 5.44 22.9
1 Albania 2 2 27.0 98.7 6000.0 93.0 94.0 3172.0 0.6 ... 3499.12 4.790000e+09 78.14 -2.040000e+09 18.47 15.5 18.47 1431793.9 2.21 45.4

2 rows × 358 columns

Las columnas se deben seleccionar con una lista que debe contener el nombre de las columnas deseadas.

[12]:
df.loc[0:1,["Continent"]]
[12]:
Continent
0 1
1 2
[13]:
df.loc[0:3,["Continent","Total_CO2_emissions"]]
[13]:
Continent Total_CO2_emissions
0 1 692.50
1 2 3499.12
2 3 137535.56
3 2 NaN
[15]:
df.loc[[0,3,20,100],["Country","Continent","Total_CO2_emissions"]]
[15]:
Country Continent Total_CO2_emissions
0 Afghanistan 1 692.50
3 Andorra 2 NaN
20 Bhutan 7 414.03
100 Liberia 3 472.66

Alternativamente, con el atributo iloc podemos seleccionar las columnas con su índice numérico: su posicion en la lista de columnas.

[16]:
df.iloc[0]
[16]:
Country                           Afghanistan
CountryID                                   1
Continent                                   1
Adolescent fertility rate (%)           151.0
Adult literacy rate (%)                  28.0
                                     ...
Under_five_mortality_from_IHME          231.9
Under_five_mortality_rate               257.0
Urban_population                    5740436.0
Urban_population_growth                  5.44
Urban_population_pct_of_total            22.9
Name: 0, Length: 358, dtype: object
[17]:
df.iloc[0:4, 3:7] # ídem a una matriz
[17]:
Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%)
0 151.0 28.0 NaN NaN
1 27.0 98.7 6000.0 93.0
2 6.0 69.9 5940.0 94.0
3 NaN NaN NaN 83.0
[18]:
df.iloc[0][0]
[18]:
'Afghanistan'
[19]:
df.iloc[0][0:4]
[19]:
Country                          Afghanistan
CountryID                                  1
Continent                                  1
Adolescent fertility rate (%)          151.0
Name: 0, dtype: object
[20]:
df.iloc[0][0:4].values
[20]:
array(['Afghanistan', 1, 1, 151.0], dtype=object)
[21]:
df.iloc[0][0:4].values[0]
[21]:
'Afghanistan'

Selección condicional

Además de la selección con base a índices, lo interesante es realizar selecciones mediante condiciones lógicas que permiten filtrar las filas del dataset. Por ejemplo:

[7]:
df_who = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") #dataframe

alfabetitzacio = df_who[df_who['Adult literacy rate (%)'] > 70][["Country","Adult literacy rate (%)"]]

alfabetitzacio[alfabetitzacio["Country"] == "Italy"]
[7]:
Country Adult literacy rate (%)
85 Italy 98.4
[8]:
df_who['Adult literacy rate (%)'] > 70
[8]:
0      False
1       True
2      False
3      False
4      False
       ...
197     True
198    False
199    False
200    False
201     True
Name: Adult literacy rate (%), Length: 202, dtype: bool
[37]:
seleccio = df_who['Adult literacy rate (%)'] > 70
[38]:
# En aquest codi, de les files on seleccio == True agafam les dues columnes que ens interessen
df_who[seleccio][["Country","Adult literacy rate (%)"]]
[38]:
Country Adult literacy rate (%)
1 Albania 98.7
6 Argentina 97.2
7 Armenia 99.4
10 Azerbaijan 98.8
12 Bahrain 86.5
... ... ...
193 Uruguay 96.8
195 Vanuatu 75.5
196 Venezuela 93.0
197 Vietnam 90.3
201 Zimbabwe 89.5

93 rows × 2 columns

[39]:
# Multiples criterios
# CO_emisions y Fertilidad
import numpy as np
ix = np.where((df_who["Total_CO2_emissions"] > 10) & (df_who[df_who.columns[3]] <=0.6))
ix
[39]:
(array([92]),)

Estadísticas en un DataFrame

[10]:
# Creamos un dataframe  con valores aleatorios
import numpy as np

np.random.seed(10)

df = pd.DataFrame({"one":np.random.randint(-10,10,5),
                  "two":np.random.random(5),
                  "three":np.random.randint(0,5,5)})
df
[10]:
one two three
0 -1 0.748804 0
1 -6 0.498507 2
2 5 0.224797 0
3 -10 0.198063 4
4 7 0.760531 3
[12]:
df.sum()
[12]:
one     -5.000000
two      2.430701
three    9.000000
dtype: float64
[13]:
df.sum(axis=1) # concepto de axis
[13]:
0    -0.251196
1    -3.501493
2     5.224797
3    -5.801937
4    10.760531
dtype: float64
[14]:
df.cumsum()
[14]:
one two three
0 -1 0.748804 0
1 -7 1.247311 2
2 -2 1.472108 2
3 -12 1.670170 6
4 -5 2.430701 9
[15]:
df.cumsum(axis=1)
[15]:
one two three
0 -1.0 -0.251196 -0.251196
1 -6.0 -5.501493 -3.501493
2 5.0 5.224797 5.224797
3 -10.0 -9.801937 -5.801937
4 7.0 7.760531 10.760531
[16]:
df.apply(np.abs,axis=1).cumsum() # función apply
[16]:
one two three
0 1 0.748804 0
1 7 1.247311 2
2 12 1.472108 2
3 22 1.670170 6
4 29 2.430701 9
[17]:
df.apply(np.abs).cumsum(axis=1)
[17]:
one two three
0 1.0 1.748804 1.748804
1 6.0 6.498507 8.498507
2 5.0 5.224797 5.224797
3 10.0 10.198063 14.198063
4 7.0 7.760531 10.760531
[18]:
df.cumprod()
[18]:
one two three
0 -1 0.748804 0
1 6 0.373284 0
2 30 0.083913 0
3 -300 0.016620 0
4 -2100 0.012640 0
[19]:
df.cumprod(axis=1)
[19]:
one two three
0 -1.0 -0.748804 -0.000000
1 -6.0 -2.991042 -5.982084
2 5.0 1.123983 0.000000
3 -10.0 -1.980629 -7.922515
4 7.0 5.323715 15.971145
[21]:
ones = np.ones(5) # numpy arrays
print(ones)

[1. 1. 1. 1. 1.]
[22]:
df.sub(ones,axis=0)
[22]:
one two three
0 -2.0 -0.251196 -1.0
1 -7.0 -0.501493 1.0
2 4.0 -0.775203 -1.0
3 -11.0 -0.801937 3.0
4 6.0 -0.239469 2.0
[23]:
df.sub(ones) #Alerta!
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb Cell 131 line <cell line: 1>()
----> <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#Y241sZmlsZQ%3D%3D?line=0'>1</a> df.sub(ones)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/ops/__init__.py:436, in flex_arith_method_FRAME.<locals>.f(self, other, axis, level, fill_value)
    433 axis = self._get_axis_number(axis) if axis is not None else 1
    435 other = maybe_prepare_scalar_for_op(other, self.shape)
--> 436 self, other = align_method_FRAME(self, other, axis, flex=True, level=level)
    438 if isinstance(other, ABCDataFrame):
    439     # Another DataFrame
    440     new_data = self._combine_frame(other, na_op, fill_value)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/ops/__init__.py:248, in align_method_FRAME(left, right, axis, flex, level)
    245 if isinstance(right, np.ndarray):
    247     if right.ndim == 1:
--> 248         right = to_series(right)
    250     elif right.ndim == 2:
    251         if right.shape == left.shape:

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/ops/__init__.py:239, in align_method_FRAME.<locals>.to_series(right)
    237 else:
    238     if len(left.columns) != len(right):
--> 239         raise ValueError(
    240             msg.format(req_len=len(left.columns), given_len=len(right))
    241         )
    242     right = left._constructor_sliced(right, index=left.columns)
    243 return right

ValueError: Unable to coerce to Series, length must be 3: given 5
[48]:
# nornalitzación z-score (media a 0 y desviación a 1)
ts_stand = (df - df.mean()) / df.std()
ts_stand
[48]:
one two three
0 0.000000 0.966021 -1.006231
1 -0.696733 0.045482 0.111803
2 0.836080 -0.961166 -1.006231
3 -1.254119 -1.059487 1.229837
4 1.114773 1.009150 0.670820
[49]:
print(ts_stand.mean())
print("----")
print(ts_stand.std())
one      0.000000e+00
two      2.664535e-16
three    6.661338e-17
dtype: float64
----
one      1.0
two      1.0
three    1.0
dtype: float64

Series no númericas

[24]:
df_who["Country"]
[24]:
0             Afghanistan
1                 Albania
2                 Algeria
3                 Andorra
4                  Angola
              ...
197               Vietnam
198    West Bank and Gaza
199                 Yemen
200                Zambia
201              Zimbabwe
Name: Country, Length: 202, dtype: object
[53]:
#https://www.w3schools.com/python/python_ref_string.asp
df_who.Country.str.casefold()

[53]:
0             afghanistan
1                 albania
2                 algeria
3                 andorra
4                  angola
              ...
197               vietnam
198    west bank and gaza
199                 yemen
200                zambia
201              zimbabwe
Name: Country, Length: 202, dtype: object
[25]:
df_who.Country.str.zfill(12)
[25]:
0            0Afghanistan
1            00000Albania
2            00000Algeria
3            00000Andorra
4            000000Angola
              ...
197          00000Vietnam
198    West Bank and Gaza
199          0000000Yemen
200          000000Zambia
201          0000Zimbabwe
Name: Country, Length: 202, dtype: object

Modificación del dataframe

Además de realizar selecciones, en algunos momentos necesitaremos incorporar nueva información a nuestras tablas de datos.

Vamos a crear un pequeño conjunto para practicar:

[26]:
df2 = pd.DataFrame([('Foreign Cinema', 'Restaurant', 289.0),
                   ('Liho Liho', 'Restaurant', 224.0),
                   ('500 Club', 'bar', 80.5),
                   ('The Square', 'bar', 25.30)],
           columns=('name', 'type', 'AvgBill')
                 )
df2
[26]:
name type AvgBill
0 Foreign Cinema Restaurant 289.0
1 Liho Liho Restaurant 224.0
2 500 Club bar 80.5
3 The Square bar 25.3

Tenemos diversas maneras de añadir columnas a un dataFrame:

  • Mediante el nombre de la columna que queremos añadir, tal como añadimos una nueva clave a un diccionario.

  • insert: es un método que necesita 3 parámetros. La posición en la que queremos añadir la columna (loc), su nombre (´column´) y la lista de valores (value).

  • assign: muy similar a la anterior, pero permite añadir múltiples columnas.

  • concat: no se suele usar para concatenar columnas, en el caso que queramos usarlo para este caso, deberemos poner el parámetro axis=1.

Veamos algunos ejemplos:

[24]:
df2['Day'] = "Monday" # Como un diccionario
df2
[24]:
name type AvgBill Day
0 Foreign Cinema Restaurant 289.0 Monday
1 Liho Liho Restaurant 224.0 Monday
2 500 Club bar 80.5 Monday
3 The Square bar 25.3 Monday
[28]:
df2['Day'] = ['Monday', 'Tuesday', 'Wednesday']
df2
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb Cell 142 line <cell line: 1>()
----> <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#Y316sZmlsZQ%3D%3D?line=0'>1</a> df2['Day'] = ['Monday', 'Tuesday', 'Wednesday']
      <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#Y316sZmlsZQ%3D%3D?line=1'>2</a> df2

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/frame.py:3980, in DataFrame.__setitem__(self, key, value)
   3977     self._setitem_array([key], value)
   3978 else:
   3979     # set column
-> 3980     self._set_item(key, value)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/frame.py:4174, in DataFrame._set_item(self, key, value)
   4164 def _set_item(self, key, value) -> None:
   4165     """
   4166     Add series to DataFrame in specified column.
   4167
   (...)
   4172     ensure homogeneity.
   4173     """
-> 4174     value = self._sanitize_column(value)
   4176     if (
   4177         key in self.columns
   4178         and value.ndim == 1
   4179         and not is_extension_array_dtype(value)
   4180     ):
   4181         # broadcast across multiple columns if necessary
   4182         if not self.columns.is_unique or isinstance(self.columns, MultiIndex):

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/frame.py:4915, in DataFrame._sanitize_column(self, value)
   4912     return _reindex_for_setitem(Series(value), self.index)
   4914 if is_list_like(value):
-> 4915     com.require_length_match(value, self.index)
   4916 return sanitize_array(value, self.index, copy=True, allow_2d=True)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/common.py:571, in require_length_match(data, index)
    567 """
    568 Check the length of data matches the length of the index.
    569 """
    570 if len(data) != len(index):
--> 571     raise ValueError(
    572         "Length of values "
    573         f"({len(data)}) "
    574         "does not match length of index "
    575         f"({len(index)})"
    576     )

ValueError: Length of values (3) does not match length of index (4)
[29]:
#Vamos a usar el método insert
df2.insert(loc=1, column="Stars", value=[2,2,3,4])
df2
[29]:
name Stars type AvgBill Day
0 Foreign Cinema 2 Restaurant 289.0 Monday
1 Liho Liho 2 Restaurant 224.0 Tuesday
2 500 Club 3 bar 80.5 Wednesday
3 The Square 4 bar 25.3 Thursday
[33]:
df3 = df2.assign(AvgHalfBill=df2.AvgBill / 2, Michelin_Star=3)
df3

df3["HOLA"] = df3.name.str.capitalize()
[34]:
df3
[34]:
name Stars type AvgBill Day AvgHalfBill Michelin_Star HOLA
0 Foreign Cinema 2 Restaurant 289.0 Monday 144.50 3 Foreign cinema
1 Liho Liho 2 Restaurant 224.0 Tuesday 112.00 3 Liho liho
2 500 Club 3 bar 80.5 Wednesday 40.25 3 500 club
3 The Square 4 bar 25.3 Thursday 12.65 3 The square

Para añadir filas a un dataframe tenemos dos métodos que realizan la tarea y devuelven un nuevo dataFrame:

  • append: añade una fila a un dataFrame. La fila puede ser un diccionario, una Serie o otro Dataframe. El parámetro ignore_index = True significa que se ignorará el índice de la serie o el dataFrame de origen. En su lugar, se utilizará el índice disponible en el dataFrame de destino. El valor False significa lo contrario.

  • concat: concatena dos o más dataFrames separados por comas.

Tenemos el método drop que nos proporciona un nuevo dataFrame sin la(s) fila(s) o la(s) columna(s) que seleccionemos. Si queremos eliminar columnas podemos hacerlo especificando la lista de columnas en el parámetro columns de la siguiente manera:

[37]:
df3.drop(columns=["Michelin_Star"],inplace=True) # Eliminamos la última columna que hemos creado
df3
[37]:
name Stars type AvgBill Day AvgHalfBill HOLA
0 Foreign Cinema 2 Restaurant 289.0 Monday 144.50 Foreign cinema
1 Liho Liho 2 Restaurant 224.0 Tuesday 112.00 Liho liho
2 500 Club 3 bar 80.5 Wednesday 40.25 500 club
3 The Square 4 bar 25.3 Thursday 12.65 The square

Para poder eliminar filas, usamos la misma función, esta vez sin el parámetro que hemos usado anteriormente, simplemente indicamos los índices a eliminar:

[29]:
df_less_rows = df_no_michelin.drop([1,3])
df_less_rows
[29]:
name Stars type AvgBill Day AvgHalfBill
0 Foreign Cinema 2 Restaurant 289.0 Monday 144.50
2 500 Club 3 bar 80.5 Wednesday 40.25
[30]:
dfs1 = df3[df3.Stars>=3] # per una selecció?
dfs1
[30]:
name Stars type AvgBill Day AvgHalfBill Michelin_Star
2 500 Club 3 bar 80.5 Wednesday 40.25 3
3 The Square 4 bar 25.3 Thursday 12.65 3
[31]:
df3[df3.Stars>=3].index
[31]:
Int64Index([2, 3], dtype='int64')
[32]:
df3.drop(df3[df3.Stars>=3].index)
[32]:
name Stars type AvgBill Day AvgHalfBill Michelin_Star
0 Foreign Cinema 2 Restaurant 289.0 Monday 144.5 3
1 Liho Liho 2 Restaurant 224.0 Tuesday 112.0 3
[41]:
df3.drop(df3[df3.Stars>=3].index, inplace=True) #Alerta con la integración de los cambios
df3
[41]:
name Stars type AvgBill Day AvgHalfBill
0 Foreign Cinema 2 Restaurant 289.0 Monday 144.50
2 500 Club 3 bar 80.5 Wednesday 40.25

Concatenación y unión de dataframes

A veces, los datos vienen en diferentes archivos y necesitan ser combinados en un único archivo, este proceso implica la concatenación de dataframes. Otras veces, los datos son complementarios, es decir, hay nuevas columnas en un dataframe, y esto se conoce como realizar operaciones de unión (joins).

[38]:
df1 = pd.DataFrame([('Foreign Cinema', 'Restaurant', 289.0),
                   ('Liho Liho', 'Restaurant', 224.0),
                   ('500 Club', 'bar', 80.5),
                   ('The Square', 'bar', 25.30)],
           columns=('name', 'type', 'AvgBill')
                 )
df1
[38]:
name type AvgBill
0 Foreign Cinema Restaurant 289.0
1 Liho Liho Restaurant 224.0
2 500 Club bar 80.5
3 The Square bar 25.3
[43]:
df2 = pd.DataFrame([('Biels', 'Quiosquet', 389.0),
                   ('Barkus', 'Bar', 24.0),
                   ('Blue Wall', 'bar', 80.5),
                   ('Bounty Hunters', 'Social Club', 125.30)],
           columns=('name2', 'type', 'AvgBill')
                 )
df2
[43]:
name2 type AvgBill
0 Biels Quiosquet 389.0
1 Barkus Bar 24.0
2 Blue Wall bar 80.5
3 Bounty Hunters Social Club 125.3
[44]:
dfAll = pd.concat([df1,df2])
dfAll
[44]:
name type AvgBill name2
0 Foreign Cinema Restaurant 289.0 NaN
1 Liho Liho Restaurant 224.0 NaN
2 500 Club bar 80.5 NaN
3 The Square bar 25.3 NaN
0 NaN Quiosquet 389.0 Biels
1 NaN Bar 24.0 Barkus
2 NaN bar 80.5 Blue Wall
3 NaN Social Club 125.3 Bounty Hunters
[42]:
dfAll = dfAll.reset_index()
dfAll
[42]:
index name type AvgBill
0 0 Foreign Cinema Restaurant 289.0
1 1 Liho Liho Restaurant 224.0
2 2 500 Club bar 80.5
3 3 The Square bar 25.3
4 0 Biels Quiosquet 389.0
5 1 Barkus Bar 24.0
6 2 Blue Wall bar 80.5
7 3 Bounty Hunters Social Club 125.3
[ ]:
dfAll.drop(columns=["index"])
[ ]:
# https://pandas.pydata.org/docs/reference/api/pandas.concat.html
dfAll = pd.concat([df1,df2],ignore_index=True)
dfAll

Más información: https://pandas.pydata.org/docs/user_guide/merging.html

[45]:
df1 = pd.DataFrame([('Jhon', 1, "Italy"),
                   ('Pep', 2, "Germany"),
                   ('William', 3, "Finland"),
                   ('Snake', 4, "Italy")],
           columns=('name', 'ID', 'Country')
                 )
df1
[45]:
name ID Country
0 Jhon 1 Italy
1 Pep 2 Germany
2 William 3 Finland
3 Snake 4 Italy
[46]:
df2 = pd.DataFrame([(1, 145.0, 3000.1),
                   ( 2, 189.2, 2030.2),
                   ( 3, 129.0, 3000.0),
                   ( 4, 198.1, 4020.2)],
           columns=('DNI', 'Weight', 'Salary')
                 )
df2
[46]:
DNI Weight Salary
0 1 145.0 3000.1
1 2 189.2 2030.2
2 3 129.0 3000.0
3 4 198.1 4020.2
[47]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

df1.merge(df2)
---------------------------------------------------------------------------
MergeError                                Traceback (most recent call last)
/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb Cell 167 line <cell line: 3>()
      <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#Y352sZmlsZQ%3D%3D?line=0'>1</a> # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
----> <a href='vscode-notebook-cell:/Users/isaac/Projects/TxADM_notebooks/notebooks/Part2/00_Pandas/01_Introduccion.ipynb#Y352sZmlsZQ%3D%3D?line=2'>3</a> df1.merge(df2)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/frame.py:10093, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
  10074 @Substitution("")
  10075 @Appender(_merge_doc, indents=2)
  10076 def merge(
   (...)
  10089     validate: str | None = None,
  10090 ) -> DataFrame:
  10091     from pandas.core.reshape.merge import merge
> 10093     return merge(
  10094         self,
  10095         right,
  10096         how=how,
  10097         on=on,
  10098         left_on=left_on,
  10099         right_on=right_on,
  10100         left_index=left_index,
  10101         right_index=right_index,
  10102         sort=sort,
  10103         suffixes=suffixes,
  10104         copy=copy,
  10105         indicator=indicator,
  10106         validate=validate,
  10107     )

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/reshape/merge.py:110, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     93 @Substitution("\nleft : DataFrame or named Series")
     94 @Appender(_merge_doc, indents=0)
     95 def merge(
   (...)
    108     validate: str | None = None,
    109 ) -> DataFrame:
--> 110     op = _MergeOperation(
    111         left,
    112         right,
    113         how=how,
    114         on=on,
    115         left_on=left_on,
    116         right_on=right_on,
    117         left_index=left_index,
    118         right_index=right_index,
    119         sort=sort,
    120         suffixes=suffixes,
    121         indicator=indicator,
    122         validate=validate,
    123     )
    124     return op.get_result(copy=copy)

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/reshape/merge.py:685, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate)
    681     # stacklevel chosen to be correct when this is reached via pd.merge
    682     # (and not DataFrame.join)
    683     warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())
--> 685 self.left_on, self.right_on = self._validate_left_right_on(left_on, right_on)
    687 cross_col = None
    688 if self.how == "cross":

File ~/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pandas/core/reshape/merge.py:1434, in _MergeOperation._validate_left_right_on(self, left_on, right_on)
   1432 common_cols = left_cols.intersection(right_cols)
   1433 if len(common_cols) == 0:
-> 1434     raise MergeError(
   1435         "No common columns to perform merge on. "
   1436         f"Merge options: left_on={left_on}, "
   1437         f"right_on={right_on}, "
   1438         f"left_index={self.left_index}, "
   1439         f"right_index={self.right_index}"
   1440     )
   1441 if (
   1442     not left_cols.join(common_cols, how="inner").is_unique
   1443     or not right_cols.join(common_cols, how="inner").is_unique
   1444 ):
   1445     raise MergeError(f"Data columns not unique: {repr(common_cols)}")

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
[48]:
df1.merge(df2, left_on='ID', right_on='DNI')
[48]:
name ID Country DNI Weight Salary
0 Jhon 1 Italy 1 145.0 3000.1
1 Pep 2 Germany 2 189.2 2030.2
2 William 3 Finland 3 129.0 3000.0
3 Snake 4 Italy 4 198.1 4020.2
[49]:
df2 = pd.DataFrame([(1, 145.0, 3000.1), #No 2
                   ( 3, 129.0, 3000.0), # Multiples 3
                   ( 3, 159.0, 4000.0),
                   ( 3, 109.0, 5000.0),
                   ( 4, 198.1, 4020.2),
                   ( 5, 200.0, 5000.2)], #a new one
           columns=('DNI', 'Weight', 'Salary')
                 )
df2
[49]:
DNI Weight Salary
0 1 145.0 3000.1
1 3 129.0 3000.0
2 3 159.0 4000.0
3 3 109.0 5000.0
4 4 198.1 4020.2
5 5 200.0 5000.2
[50]:
df1.merge(df2, left_on='ID', right_on='DNI')
[50]:
name ID Country DNI Weight Salary
0 Jhon 1 Italy 1 145.0 3000.1
1 William 3 Finland 3 129.0 3000.0
2 William 3 Finland 3 159.0 4000.0
3 William 3 Finland 3 109.0 5000.0
4 Snake 4 Italy 4 198.1 4020.2
[51]:
df1.merge(df2, left_on='ID', right_on='DNI',how="left")
[51]:
name ID Country DNI Weight Salary
0 Jhon 1 Italy 1.0 145.0 3000.1
1 Pep 2 Germany NaN NaN NaN
2 William 3 Finland 3.0 129.0 3000.0
3 William 3 Finland 3.0 159.0 4000.0
4 William 3 Finland 3.0 109.0 5000.0
5 Snake 4 Italy 4.0 198.1 4020.2
[52]:
df1.merge(df2, left_on='ID', right_on='DNI',how="right")
[52]:
name ID Country DNI Weight Salary
0 Jhon 1.0 Italy 1 145.0 3000.1
1 William 3.0 Finland 3 129.0 3000.0
2 William 3.0 Finland 3 159.0 4000.0
3 William 3.0 Finland 3 109.0 5000.0
4 Snake 4.0 Italy 4 198.1 4020.2
5 NaN NaN NaN 5 200.0 5000.2
[53]:
df1.merge(df2, left_on='ID', right_on='DNI',how="inner")
[53]:
name ID Country DNI Weight Salary
0 Jhon 1 Italy 1 145.0 3000.1
1 William 3 Finland 3 129.0 3000.0
2 William 3 Finland 3 159.0 4000.0
3 William 3 Finland 3 109.0 5000.0
4 Snake 4 Italy 4 198.1 4020.2
[54]:
df1.merge(df2,how="cross")
[54]:
name ID Country DNI Weight Salary
0 Jhon 1 Italy 1 145.0 3000.1
1 Jhon 1 Italy 3 129.0 3000.0
2 Jhon 1 Italy 3 159.0 4000.0
3 Jhon 1 Italy 3 109.0 5000.0
4 Jhon 1 Italy 4 198.1 4020.2
5 Jhon 1 Italy 5 200.0 5000.2
6 Pep 2 Germany 1 145.0 3000.1
7 Pep 2 Germany 3 129.0 3000.0
8 Pep 2 Germany 3 159.0 4000.0
9 Pep 2 Germany 3 109.0 5000.0
10 Pep 2 Germany 4 198.1 4020.2
11 Pep 2 Germany 5 200.0 5000.2
12 William 3 Finland 1 145.0 3000.1
13 William 3 Finland 3 129.0 3000.0
14 William 3 Finland 3 159.0 4000.0
15 William 3 Finland 3 109.0 5000.0
16 William 3 Finland 4 198.1 4020.2
17 William 3 Finland 5 200.0 5000.2
18 Snake 4 Italy 1 145.0 3000.1
19 Snake 4 Italy 3 129.0 3000.0
20 Snake 4 Italy 3 159.0 4000.0
21 Snake 4 Italy 3 109.0 5000.0
22 Snake 4 Italy 4 198.1 4020.2
23 Snake 4 Italy 5 200.0 5000.2

Usando el fichero who.csv, se pide:

1) ¿Cuál és la media de la población urbana (“Urban_population”) de todos los países? ¿Su desviación típica (std)?

[ ]:

2) Consulta la fila del país: “Spain”

[ ]:

3a) ¿Qué país tiene una mayor población urbana?

[ ]:

3b) ¿Qué paises tienen una población urbana menor a 50000 ?

[ ]:

4) ¿El continente donde está situado Spain es el mismo que el de `United States of America?

Utiliza una condición para obtener un resultado Booleano (True o False)

[ ]:

5) ¿Cuáles son los cinco paises más contaminantes (“Total_CO2_emissions”)?

Esta es mi pista para una solución elegante: http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.sort_values.html

[ ]:

6) Observando algunas muestras del fichero puedes establecer la relación entre el identificador del continente y su nombre?

Es decir, sabemos que Spain está en el continente Europeo y el código del continente es el 2.

Existen los códigos de continentes: 1, 2, 3, 4, 5, 6, 7

Nota: Hay dos códigos asociados a Asia.

Haz las consultas pertinentes al dataframe para construir un diccionario con la siguiente estructura:

[ ]:
codigoContinentes = {1:"Asia",2:"Europa"} #Al menos hay 7!
print(codigoContinentes[2])
[ ]:
codigoContinentes = {1:"Asia",2:"Europa",3:"", 4:""} # TODO...

7) Una vez identificado el nombre de los continentes, ¿puedes cambiar la columna de identificadores de continentes por sus respectivos nombres?

Esta es es mi pista para una solución elegante: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html

[ ]:

8) Puedes crear un nuevo dataframe con aquellos paises que sean de Europa?

[ ]:
df2 = df #Con una simple asignación ya creas una dataframe
type(df2)

9) ¿Cuáles son los paises más contaminantes de Europa?

[ ]:

10) Calcula la cantidad de ayuda recibida por cada municipio en función del númeto total de habitantes.

[7]:
import pandas as pd
import random

random.seed(0)

nombres = [f'Municipio{i}' for i in range(1, 11)]

data_municipios = {
    'Nombre': nombres,
    'Código Postal': [random.randint(10000, 99999) for _ in range(10)],
    'Población': [random.randint(1000, 50000) for _ in range(10)]  # Añadimos un atributo aleatorio, en este caso "Población"
}

df_municipios = pd.DataFrame(data_municipios)


data_ayudas = {
    'Nombre':  [f'Municipio{i}' for i in range(1, 11)],
    'Ayuda Económica (en euros)': [random.randint(1000, 10000) for _ in range(10)],
    'Número de Beneficiarios': [random.randint(10, 100) for _ in range(10)]
}

df_ayudas = pd.DataFrame(data_ayudas)

print("Dataframe de Municipios:")
print(df_municipios)

print("\nDataframe de Ayudas:")
print(df_ayudas)
Dataframe de Municipios:
        Nombre  Código Postal  Población
0   Municipio1          60494      39232
1   Municipio2          65125      15315
2   Municipio3          15306      34075
3   Municipio4          43936      10127
4   Municipio5          77013      19470
5   Municipio6          73691      10158
6   Municipio7          63075       7214
7   Municipio8          49755      41525
8   Municipio9          72468      17417
9  Municipio10          56930      35902

Dataframe de Ayudas:
        Nombre  Ayuda Económica (en euros)  Número de Beneficiarios
0   Municipio1                        3407                       88
1   Municipio2                        6081                       91
2   Municipio3                        2618                       36
3   Municipio4                        2208                       80
4   Municipio5                        6409                       71
5   Municipio6                        8735                       66
6   Municipio7                        2649                       76
7   Municipio8                        6796                       43
8   Municipio9                        8113                       17
9  Municipio10                        6180                       80

Usando el fichero climaMallorca.csv, se pide: 11) ¿Cual es la temperatura máxima cuando el viento es inferior a 10? ¿Cuántas muestras hay?

[ ]:

12) ¿Cual es la temperatura máxima cuando el viento es superior a 10 y inferior a 20? ¿Cuántas muestras hay?

[ ]:

License: CC BY 4.0 Isaac Lera and Gabriel Moya Universitat de les Illes Balears isaac.lera@uib.edu, gabriel.moya@uib.edu