1

I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.

The example of the whole CSV:

oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...

I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number. I am wondering how is it possible to change all the column values to parsed values using that function?

def do_process_citation_data(f_path):
    global my_ocan

    my_ocan = pd.read_csv("citations.csv",
                          names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
                          parse_dates=['creation', 'timespan'])
    my_ocan = my_ocan.iloc[1:]  # to remove the first row iloc - to select data by row numbers
    my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)


    return my_ocan


def parse():
     mydict = dict()
     mydict2 = dict()
     i = 1
     r = 1
     for x in my_ocan['oci']:
        mydict[x] = str(my_ocan['timespan'][i])
        i +=1
     print(mydict)
     for key, value in mydict.items():
        is_negative = value.startswith('-')
        if is_negative:
            date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
        else:
            date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
        year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
        daystotal = (year * 365) + (month * 30) + day
        if not is_negative:
            #mydict2[key] = daystotal
            return daystotal
        else:
           #mydict2[key] = -daystotal
            return -daystotal
     #print(mydict2)
     #return mydict2

Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.

Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:

      for x in my_ocan['timespan']:
          x = parse(str(my_ocan['timespan'])

How can I get a column with new values? Thank you! Peace :)

2
  • df['timespan'].apply(parse)? You should change your parse function to work on a single value though i.e. take a timespan string like 'P1Y4M1D' as it's input Commented May 19, 2020 at 11:11
  • @Dan Thank you!:) Commented May 19, 2020 at 13:48

1 Answer 1

1

A df['timespan'].apply(parse) (as mentioned by @Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:

import pandas as pd

def parse_postal_code(postal_code):
    # Splitting postal code and getting first letters
    letters = postal_code.split('_')[0]
    return letters


# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})

# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))

# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)

print(df['Postal Code Letter'])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.