0

I have a column of strings/Nones that I'd like to convert to date. But, if there's an element that I can't convert to date (in my format), I'd like None there, but I also need to know it failed to convert(i.e. it's not None because it was there as None originally, but was incorrect entry to start with)

I could, of course, return say an error string in the conversion function, like:

import pandas as pd
import datetime

def convert(x):
    if x is None:
        return None
    try:
        d = datetime.datetime.strptime(x, '%Y%m%d')
    except:
        d = f"Error converting {x}"
    return d

s1 = pd.Series(["2021228", "2021228", "2021228", None, "x"])
s2 = s1.apply(convert)
print(s2)

0    2021-02-28 00:00:00
1    2021-02-28 00:00:00
2    2021-02-28 00:00:00
3                   None
4     Error converting x
dtype: object

But then I need to check for all the elements of the column that are error string, and, because I need them listed, would need to separate them into their own column/series.

Which TBH seems like a major pain to me, as it means I'd have to run map/apply or something similar again, so two passes for each column I need to verify.

Edit: And, in a more generic way, I don't like checking for a specific string in the first place. If it was an operation on strings, not dates, it'd be rather unreliable.

I have tried to return a Series from the apply, with labels value and error, but if there is an error (i.e. a None value in the value label) , all datetimes get converted into a float, and I can't figure out why (I looked into pandas' code, it marks it as mixed type, and I suspect does some conversion somewhere).

import pandas as pd
import datetime

def convert(x):
    v = None
    e = None
    if x is not None:
        try:
            v = datetime.datetime.strptime(x, '%Y%m%d')
        except:
            e = f"Error converting {x}"
    return pd.Series({"value":v, "error":e})

s1 = pd.Series(["2021228", "2021228", "2021228", None, "x"])
s2 = s1.apply(convert)
print(s2["value"])

0    1.614470e+18
1    1.614470e+18
2    1.614470e+18
3             NaN
4             NaN

Any ideas?

1 Answer 1

1

Actually, the easiest way seem to be to use transform(), with two functions, one which does the transformation and the other produces the error. I don't know whether it's the most efficient, but it works ok on the relatively small datasets I have.

something like

def value(x):
  return convert(x) if valid(x) else None

def error(x):
  return None if valid(x) else f"{x} is causing an error!"

s = s1.transform([value, error])

s["value"] then has the right values (or Nones), s["error"] has the errors.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.