I am trying to complete the below question and am running into some creating this function and assigning string values to the various dates. How would I go about creating this function to return the various string values.
1 Answer
Here is my suggestion :
EDIT : using this question for the regex to check time validity
import pandas as pd
import numpy as np
import re
def compute_time_day_year(data_dict):
"""
returns: pandas DataFrame with variables weekday (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday),
time_of_day (Morning - 06:00-11:59, Afternoon - 12:00-17:59, Evening - 18:00-23:59, Night - 00:00 - 05:59),
and season (Summer for June, July and August; Autumn for September, October, November; Winter for
December, January, February; Spring for March, April, May). If either of the input parameters is in incorrect
form, the function returns INVALID in all outputs
"""
df = pd.DataFrame({'year': (data_dict['Year']),
'month': (data_dict['Crash_Month']),
'day': (data_dict['Crash_Day']),
'time': (data_dict['Crash_Time'])})
df["invalid_time"] = df["time"].apply(lambda x: not re.match("^([0-1]?[0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]$", x))
df.loc[~df["invalid_time"], 'date'] = pd.to_datetime(df.loc[~df["invalid_time"],'year'].astype(str)
+ df.loc[~df["invalid_time"],'month'].astype(str).str.zfill(2)
+ df.loc[~df["invalid_time"],'day'].astype(str).str.zfill(2)
+ ' '
+ df.loc[~df["invalid_time"],'time'].astype(str), format='%Y%m%d %H:%M:%S')
df.loc[~df["invalid_time"],'weekday'] = df.loc[~df["invalid_time"],"date"].dt.day_name(locale="english")
df.loc[~df["invalid_time"],'season'] = (df.loc[~df["invalid_time"],'date'] - pd.DateOffset(months=1))\
.dt.quarter
df.loc[~df["invalid_time"],'season'] = df.loc[~df["invalid_time"],'season']\
.map({1: 'Winter', 2: 'Spring', 3: 'Summer', 4: 'Autumn'})
df.loc[~df["invalid_time"],"time_of_day"] = df.loc[~df["invalid_time"],"date"].dt.hour\
.apply(lambda x: np.select(
[0 < x <= 6,
6 < x <= 12,
12 < x <= 18,
18 < x <= 24],
["Night", "Morning", "Afternoon", "Evening"]
))
df.loc[df["invalid_time"],["weekday", "season", "time_of_day"]] = "Invalid"
return df.loc[:, ["weekday", "season", "time_of_day"]]
data_dict = {'Year': [2018, 2019, 2020],
'Crash_Month': [1, 2, 3],
'Crash_Day': [4, 5, 6],
'Crash_Time': ["8:00:00", '26:22:00', '8:12:00']}
compute_time_day_year(data_dict)
And it returns for this example:
weekday season time_of_day
0 Thursday Autumn Morning
1 Invalid Invalid Invalid
2 Friday Winter Morning
Hope it helps.
7 Comments
nickp
thanks a lot for that. I actually imported the data from a CSV, hence why I reference fields as data_dict[...]. This coding doesn't seem to work for the data_dict CSV, do you have any other ideas as to how to code it
Raphaele Adjerad
@nickp I changed my answer to match your question.
nickp
Thanks again. Just to clarify, I want an invalid output to come up if one of the values in Crash_Time is not a time. How would I code this? I have tried if type != time, which does not work
nickp
thanks so much! just one more thing, I only want the invalid output to come up for those entries that are invalid, and not the whole data frame being invalid. Is there a way to do this? At the moment, if I add in the If statement, it only returns 1 row saying INVALID, however, I would still like to be able to see all the other rows if they are not INVALID. Also, I would like to print the first 10 records from the data frame, I assume there is a way to do that?
Raphaele Adjerad
You can filter the resulting df based on values for instance if
df is the output of my example, so see only valid use df.loc[df["time_of_day"] != "invalid", :] and to see only invalid df.loc[df["time_of_day"] == "invalid", :]. If the answer suits you could you mark it as accepted? There is no obligation to do so :). |