0

I have large piece of data that is problematic to load entirely to memory so I have decided to read it row-by-row, picking desired data, making transformations etc. and then clearing variables and pick another row.

It works fine while I am using csv.reader.

    source_file = open(path_to_source_file, encoding="utf8", newline='')
    spamreader = csv.reader(source_file , delimiter=";", quotechar='"')

    for id, row in enumerate(spamreader):
        variable1 = row[0]
        variable2 = row[1]
        variable3 = row[2]
        variable4 = row[3]
        ...
        variable25 = row[26]
        ...
        if id%1000000==0: print(f"Number of processed rows: {id}")

However, I am afraid that in the future some columns may be added/removed/swapped and adressing them by row[id] is not "safe" option.

I was thinking if there is possibility to load them with pandas because dataframe allows addressing by columname:

    for id, row in df.iterrows():
        variable1 = row['column0']
        variable2 = row['column1']
        variable3 = row['column2']
        variable4 = row['column3']
        ...
        variable25 = row['column25']
        ...
        if id%1000000==0: print(f"Number of processed rows: {id}")
        

However I wasn't able to make it work in comparable speed.

I have tried use chunksize=1, but even "pass" took very long time to execute:

with pd.read_csv(path_to_source_file, sep=";", quotechar='"', chunksize=1) as reader:
    for chunk in reader:
        pass

I've decided also to try nested fors and picking bigger chunks and then iterating over rows:

with pd.read_csv(path_to_source_file, sep=";", quotechar='"', chunksize=10000) as reader:
    for chunk in reader:
        for id, row in chunk.iterrows():
            pass

but this is also really slow if compared to read_csv version.

So,to summarize my question: Is there any way to load data from csv file with pandas row-by-row to get comparatable speed to csv.reader?

2
  • I hope you have seen this answer stackoverflow.com/a/55557758 Commented Mar 7, 2022 at 13:15
  • I have seen it some time ago. However, i am not trying to iterate over dataframe to make some changes in the data, instead i am trying to prevent getting that big dataframe loaded to my memory entirely to begin with. Thats true - i have "iterrows()" in my last piece of code but this is made only because loading data in chunks with size 1 is even slower than that. Commented Mar 7, 2022 at 13:25

2 Answers 2

2

the csv module also provides the DictReader method.

reader = csv.DictReader(csv_file)
print(reader.fieldnames)

by default, columns names are inferred from the first row, alternatively you can specify what they should be by passing a sequence with the fieldnames parameter.

python csv module

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your suggestion. It works just like intended - i can assign data by column names without thinking about their order. However - sadly - it is still little slower than csv.reader. It is still faster than anything i was trying in pandas (pandas took above 5x more time) but still - it cost some performence so I have decided to manualy read column names and stick to csv.reader.
0

After checking @el_oso answer to use DictReader I have tested how fast it works.

To summarize my test:

  • csv.reader took 1min 57s to load part of my data.
  • csv.DictReader took 2min 32s to load the same data.

Because i don't want to lose performance I've decided to make some work-around with csv.reader to get column list while reading first row:

    source_file = open(path_to_source_file, encoding="utf8", newline='')
    spamreader = csv.reader(source_file , delimiter=";", quotechar='"')

    for id, row in enumerate(spamreader):
        if id==0:
            cols={}
            for col_id, column in enumerate(row):
                cols[column]=col_id
        else:
            variable1 = row[cols['column0']]
            variable2 = row[cols['column1']]
            variable3 = row[cols['column2']]
            variable4 = row[cols['column3']]
            ...
            variable25 = row[cols['column24']]
            ...
            if id%1000000==0: print(f"Number of processed rows: {id}")
  • Using this method on same data took 1min54s (which is similar to csv.reader)

However, I am still open for any faster solutions if there will be any ideas.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.