How would I normalize dates in a csv file? python

Question

I have a CSV file with a field named start_date that contains data in a variety of formats.

Some of the formats include e.g., June 23, 1912 or 5/11/1930 (month, day, year). But not all values are valid dates.

I want to add a start_date_description field adjacent to the start_date column to filter invalid date values into. Lastly, normalize all valid date values in start_date to ISO 8601 (i.e., YYYY-MM-DD).

So far I was only able to load the start_date into my file, I am stuck and would appreciate ant help. Please, any solution especially without using a library would be great!

import csv

date_column = ("start_date")
f = open("test.csv","r")
csv_reader = csv.reader(f)

headers = None
results = []
for row in csv_reader:
    if not headers:
        headers = []
        for i, col in enumerate(row):
           if col in date_column:
            headers.append(i)
    else:
        results.append(([row[i] for i in headers]))

print results

Perhaps the dateparser module could help here if you don't know the exact formats of the dates you're receiving — Tim Pietzcker
– Tim Pietzcker, Commented Jul 8, 2017 at 7:20

titipata · Accepted Answer · 2017-07-11 21:25:28Z

4

One way is to use dateutil module, you can parse data as follows:

from dateutil import parser
parser.parse('3/16/78')
parser.parse('4-Apr') # this will give current year i.e. 2017

Then parsing to your format can be done by

dt = parser.parse('3/16/78')
dt.strftime('%Y-%m-%d')

Suppose you have table in dataframe format, you can now define parsing function and apply to column as follows:

def parse_date(start_time):
    try:
        return parser.parse(x).strftime('%Y-%m-%d')
    except:
        return ''
df['parse_date'] = df.start_date.map(lambda x: parse_date(x))

edited Jul 11, 2017 at 21:25

answered Jul 8, 2017 at 8:24

titipata

5,3894 gold badges39 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vash Over a year ago

How would run your example to be evaluated by the whole csv file?

titipata Over a year ago

I update the my solution. Let me know if it works for you. I assume that your dataframe has start_date as a column

stovfl · Accepted Answer · 2017-07-11 20:50:23Z

1

Question: ... add a start_date_description ... normalize ... to ISO 8601

This reads the File test.csv and validates the Date String in Column start_date with Date Directive Patterns and returns a dict{description, ISO}. The returned dict is used to update the current Row dict and the updated Row dict is writen to the File test_update.csv.

Put this in a NEW Python File and run it!

A missing valid Date Directive Pattern could be simple added to the Array.

Python » 3.6 Documentation: 8.1.8. strftime() and strptime() Behavior

from datetime import datetime as dt
import re

def validate(date):
    def _dict(desc, date):
        return {'start_date_description':desc, 'ISO':date}

    for format in [('%m/%d/%y','Valid'), ('%b-%y','Short, missing Day'), ('%d-%b-%y','Valid'),
                   ('%d-%b','Short, missing Year')]: #, ('%B %d. %Y','Valid')]:
        try:
            _dt = dt.strptime(date, format[0])
            return _dict(format[1], _dt.strftime('%Y-%m-%d'))
        except:
            continue

    if not re.search(r'\d+', date):
        return _dict('No Digit', None)

    return _dict('Unknown Pattern', None)

with open('test.csv') as fh_in, open('test_update.csv', 'w') as fh_out:
    csv_reader = csv.DictReader(fh_in)
    csv_writer = csv.DictWriter(fh_out,
                                fieldnames=csv_reader.fieldnames +
                                           ['start_date_description', 'ISO'] )
    csv_writer.writeheader()

    for row, values in enumerate(csv_reader,2):
        values.update(validate(values['start_date']))

        # Show only Invalid Dates
        if any(w in values['start_date_description'] 
               for w in ['Unknown', 'No Digit', 'missing']):

            print('{:>3}: {v[start_date]:13.13} {v[start_date_description]:<22} {v[ISO]}'.
                  format(row, v=values))

        csv_writer.writerow(values)

Output:

start_date    start_date_description ISO
June 23. 1912 Valid                  1912-06-23
12/31/91      Valid                  1991-12-31
Oct-84        Short, missing Day     1984-10-01
Feb-09        Short, missing Day     2009-02-01
10-Dec-80     Valid                  1980-12-10
10/7/81       Valid                  1981-10-07
Facere volupt No Digit               None
... (omitted for brevity)

Tested with Python: 3.4.2

edited Jul 11, 2017 at 20:50

answered Jul 8, 2017 at 20:42

stovfl

15.6k7 gold badges26 silver badges54 bronze badges

3 Comments

Vash Over a year ago

I ran your code and it didn't work, it continues to say "_data" is not defined. I placed your code directly below mines, then ran it. Any suggestions?

Vash Over a year ago

The file name is on the image I uploaded, should i change _data to the file name ?

Vash Over a year ago

I ran it again and only what you placed in sDate was printed to the console, the actual file wasn't evaluated.

Collectives™ on Stack Overflow

How would I normalize dates in a csv file? python

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related