0

I have a CSV file which looks like below. Right now it doesn't have any columns and it contains some fields I actually don't need so I need to write it into a new file based on some conditions.

!PROJECT1, OBJECT1
2020-09-10+02:00,100,HHH,SAS,RM$20,1,1
2020-09-16+02:00,200,GGG,SAS,TAKE,2020-09-16+02:00
2020-09-13+02:00,300,TTT,SAS,TAKE,2020-09-13+02:00
2020-09-11+02:00,100,HHH,SAS,RM$20,1,1

These are the conditions:

  1. I will only write the record if index[4] contains the word TAKE. If so, take index[0], [4] and [5].
  2. Index[0] and [5] need to be spitted and named in YEAR, MONTH and DAY and TD. index[4] needs to be named TYPE

I want my new file to look like this:

YEAR    MONTH    DAY    TD    TYPE    YEAR    MONTH    DAY    TD 
2020    09       16     2     TAKE     2020    09       16     2
2020    09       13     2     TAKE     2020    09       13     2

This is my code:

def filter_row(r):
    condition_1 = r[4] == 'TAKE' #<-- take only the TAKE's

with open(file_path, 'r') as my_file, open('outputfile.txt', 'w') as outer:
        reader = csv.reader(my_file, delimiter = ',')
        next(reader) #Skip the first row because it's just the header
        writer = csv.writer(outer, delimiter = '\t')
        for row in reader:            
                if filter_row(row):
                writer.writerow(row)

Right now my output file looks like this:

2020-09-16+02:00,  200,  GGG,   SAS,  TAKE,  2020-09-16+02:00
2020-09-13+02:00,  300,  TTT,   SAS,  TAKE,  2020-09-13+02:00
4
  • 1
    use pandas. much easier. Commented Sep 17, 2020 at 13:44
  • @YashShah have been told to use CSV because I have more then 35000 rows and more then 25 columns Commented Sep 17, 2020 at 13:45
  • 1
    r[4] is 'TAKE' - you should use == instead of is here. stackoverflow.com/q/1504717/3282436 Commented Sep 17, 2020 at 13:51
  • @0x5453 True, I've changed it but what about the rest? Commented Sep 17, 2020 at 13:54

2 Answers 2

1

There are two tasks here. Firstly, filter the data, secondly parse the data to get the required values.

Filtering can be done with the built-in filter function (or a list comprehension or generator expression if you prefer). For the parsing a small function can be used as we need to process two dates per row.

operator.itemgetter is used to extract data from rows efficiently.

import csv
import operator

# Get the value of the 'take' column
type_getter = operator.itemgetter(4)
# Extract the columns we want to work with
columns_getter = operator.itemgetter(0, 4, 5)


def process_datestring(datestring):
    # Assumes offset is always positive; consider
    # using re.split if it could be positive or negative.
    date, _, offset = datestring.partition('+')
    dateparts = date.split('-')
    td, *_ = offset.partition(':')
    dateparts.append(td)
    return dateparts


headers = ['YEAR', 'MONTH', 'DAY', 'TD', 'TYPE', 'YEAR', 'MONTH', 'DAY', 'TD']
with open(file_path, 'r') as my_file, open('outputfile.txt', 'w') as outer:
    reader = csv.reader(my_file, delimiter = ',')
    next(reader) #Skip the first row because it's just the header
    writer = csv.writer(outer, delimiter = '\t')

    writer.writerow(headers)

    filtered = filter(lambda r: type_getter(r) == 'TAKE', reader)
    for row in filtered:
        date1, type_, date2 = columns_getter(row)
        out_row = process_datestring(date1)
        out_row.append(type_)
        out_row.extend(process_datestring(date2))
        writer.writerow(out_row)
Sign up to request clarification or add additional context in comments.

Comments

0

You could write a parse function. Something like this:

def parser(row):
    if "TAKE" in row[4]: #Checks for your 'TAKE' statement
        year = row[0][0:4]
        month = row[0][5:7]
        ...
        year_2 = row[5][0:4]
        ...
        return [year, month, day, td, row[4], year_2, month_2,...]

And just give this whole thing to your writer:

for row in reader:
    converted = parser(row)
    writer.writerow(converted)

This should work. If you want to be more carefull, you will have to convert the string to a datetime.datetime object using strptime and then extract the dates, but if chopping the string is good enough, you could do this.

1 Comment

Its not giving the desired output. I still get all the rows and nothing is splitted.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.