My code starts this way: it takes data from HERE and I want to extract al the rows that contain "fascia_anagrafica" equal to "20-29". In italian "fascia_anagrafica" means "age range". That was relatively simple, as you see below, and I dropped some unimportant values.
import pandas as pd
import json
import numpy
import sympy
from numpy import arange,exp
from scipy.optimize import curve_fit
from matplotlib import pyplot
import math
import decimal
df = pd.read_csv('https://raw.githubusercontent.com/italia/covid19-opendata-
vaccini/master/dati/somministrazioni-vaccini-latest.csv')
df = df[df["fascia_anagrafica"] == "20-29"]
df01=df.drop(columns= ["fornitore","area","sesso_maschile","sesso_femminile","seconda_dose","pregressa_infezione","dose_aggiuntiva","codice_NUTS1","codice_NUTS2","codice_regione_ISTAT","nome_area"])
now dataframe looks like this:IMAGE
as you see, for every date there is the "20-29 age range" and for every line you may find the value "prima_dose" which stands for "first_dose".
Now the problem: If you take into consideration the date "2020-12-27" you will notice that it is repeated about 20 times (with 20 different values) since in italy there are 21 regions, then the same applies for the other dates. Unfortunately they are not always 21, because in certain regions they didn't put any values in some days so the dataframe is NOT periodic.
I want to add a column in the dataframe that makes a sum of the values that has same date fo all dates in the dataframe. An exaple here:
Date.................prima_dose...........sum_column
2020-8-9.............. 1.......................13 <----this is (1+3+4+5 in the day 2020-8-9)
2020-8-9..............3........................8 <----this is (2+5+1 in the day 2020-8-10)
2020-8-9.............. 4..............and so on...
2020-8-9.............. 5
2020-8-10.............. 2
2020-8-10.............. 5
2020-8-10.............. 1
thanks!