1

I have a CSV file with a field named start_date that contains data in a variety of formats.

Some of the formats include e.g., June 23, 1912 or 5/11/1930 (month, day, year). But not all values are valid dates.

I want to add a start_date_description field adjacent to the start_date column to filter invalid date values into. Lastly, normalize all valid date values in start_date to ISO 8601 (i.e., YYYY-MM-DD).

So far I was only able to load the start_date into my file, I am stuck and would appreciate ant help. Please, any solution especially without using a library would be great!

import csv

date_column = ("start_date")
f = open("test.csv","r")
csv_reader = csv.reader(f)

headers = None
results = []
for row in csv_reader:
    if not headers:
        headers = []
        for i, col in enumerate(row):
           if col in date_column:
            headers.append(i)
    else:
        results.append(([row[i] for i in headers]))

print results

enter image description here

1
  • Perhaps the dateparser module could help here if you don't know the exact formats of the dates you're receiving Commented Jul 8, 2017 at 7:20

2 Answers 2

4

One way is to use dateutil module, you can parse data as follows:

from dateutil import parser
parser.parse('3/16/78')
parser.parse('4-Apr') # this will give current year i.e. 2017

Then parsing to your format can be done by

dt = parser.parse('3/16/78')
dt.strftime('%Y-%m-%d')

Suppose you have table in dataframe format, you can now define parsing function and apply to column as follows:

def parse_date(start_time):
    try:
        return parser.parse(x).strftime('%Y-%m-%d')
    except:
        return ''
df['parse_date'] = df.start_date.map(lambda x: parse_date(x))
Sign up to request clarification or add additional context in comments.

2 Comments

How would run your example to be evaluated by the whole csv file?
I update the my solution. Let me know if it works for you. I assume that your dataframe has start_date as a column
1

Question: ... add a start_date_description ... normalize ... to ISO 8601

This reads the File test.csv and validates the Date String in Column start_date with Date Directive Patterns and returns a dict{description, ISO}. The returned dict is used to update the current Row dict and the updated Row dict is writen to the File test_update.csv.

Put this in a NEW Python File and run it!

A missing valid Date Directive Pattern could be simple added to the Array.

Python » 3.6 Documentation: 8.1.8. strftime() and strptime() Behavior

from datetime import datetime as dt
import re

def validate(date):
    def _dict(desc, date):
        return {'start_date_description':desc, 'ISO':date}

    for format in [('%m/%d/%y','Valid'), ('%b-%y','Short, missing Day'), ('%d-%b-%y','Valid'),
                   ('%d-%b','Short, missing Year')]: #, ('%B %d. %Y','Valid')]:
        try:
            _dt = dt.strptime(date, format[0])
            return _dict(format[1], _dt.strftime('%Y-%m-%d'))
        except:
            continue

    if not re.search(r'\d+', date):
        return _dict('No Digit', None)

    return _dict('Unknown Pattern', None)

with open('test.csv') as fh_in, open('test_update.csv', 'w') as fh_out:
    csv_reader = csv.DictReader(fh_in)
    csv_writer = csv.DictWriter(fh_out,
                                fieldnames=csv_reader.fieldnames +
                                           ['start_date_description', 'ISO'] )
    csv_writer.writeheader()

    for row, values in enumerate(csv_reader,2):
        values.update(validate(values['start_date']))

        # Show only Invalid Dates
        if any(w in values['start_date_description'] 
               for w in ['Unknown', 'No Digit', 'missing']):

            print('{:>3}: {v[start_date]:13.13} {v[start_date_description]:<22} {v[ISO]}'.
                  format(row, v=values))

        csv_writer.writerow(values)

Output:

start_date    start_date_description ISO
June 23. 1912 Valid                  1912-06-23
12/31/91      Valid                  1991-12-31
Oct-84        Short, missing Day     1984-10-01
Feb-09        Short, missing Day     2009-02-01
10-Dec-80     Valid                  1980-12-10
10/7/81       Valid                  1981-10-07
Facere volupt No Digit               None
... (omitted for brevity)

Tested with Python: 3.4.2

3 Comments

I ran your code and it didn't work, it continues to say "_data" is not defined. I placed your code directly below mines, then ran it. Any suggestions?
The file name is on the image I uploaded, should i change _data to the file name ?
I ran it again and only what you placed in sDate was printed to the console, the actual file wasn't evaluated.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.