How to read multiple files with read_csv or lambda

Question

I am reading the same text file twice via read_csv. First time to get a list of keys that match a specific string (MSG) with 'Col6' in that file. This will give me a dataframe with only those entries that match 'Col6'. Then second time I read the same file(again with read_csv) and print some more columns if key1 == key2, which are based on 'Col1'.

I have basically two questions: 1. Can I combine both searches (read_csv) together? 2. Even if I keep the two read_csv separate, how can I read multiple files? Right now I am reading only one file (firstFile.txt), but I would like to replace the file name with '*.txt' so that read_csv operations are performed for all *.txt files in the directory.

Data file looks like following. I want to print all rows with Col1=12345 since Col6 has value 'This is a test'.

Col1  Col2    Col3    Col4    Col5    Col6
-       -       -       -       -       -
54321 544     657     888     4476    -
12345 345     456     789     1011    'This is a test'
54321 644     857     788     736     -
54321 744     687     898     7436    -
12345 365     856     789     1020    -
12345 385     956     689     1043    -
12345 385     556     889     1055    -
65432 444     676     876     4554    -
-     -       -       -       -       -
54321 544     657     888     776     -
12345 345     456     789     1011    -
54321 587     677     856     7076    -
12345 345     456     789     1011    -
65432 444     676     876     455     -
12345 345     456     789     1011    -
65432 447     776     576     4055    -
-     -       -       -       -       -   
65432 434     376     576     4155    -

The script that I used is:

import csv
import pandas as pd
import os
import glob

DL_fields1 = ['Col1', 'Col2']
DL_fields2 = ['Col1', 'Col2','Col3', 'Col4', 'Col5', 'Col6']

MSG = 'This is a test'

iter_csv = pd.read_csv('firstFile.txt', chunksize=1000, usecols=DL_fields1, skiprows=1)
df = pd.concat([chunk[chunk['Special_message'] == MSG] for chunk in iter_csv])

for i, row in df.iterrows():
    key1 = df.loc[i, 'Col1']
    j=0
    for line in pd.read_csv('firstFile.txt', chunksize=1, usecols=DL_fields2, skiprows=1, na_values={'a':'Int64'}):
        key2 = line.loc[j,'Col1']
        j = j + 1
        if (key2 == '-'):
            continue
        elif (int(key1) == int(key2)):
            print (line)

I think you should simply read the entire CSV at the beginning. Use pandas operations to filter out rows you need there after. It does not make much sense to read the same file again, that too line by line. — crazyGamer
– crazyGamer, Commented Jan 31, 2019 at 7:05
I don't get what exactly you're looking for, can you clarify more? — Nada Ghanem
– Nada Ghanem, Commented Jan 31, 2019 at 7:16
I want the following output: 12345 345 456 789 1011 'This is a test' 12345 365 856 789 1020 - 12345 345 456 789 1011 - 12345 345 456 789 1011 - 12345 365 856 789 1020 - 12345 385 956 689 1043 - 12345 385 556 889 1055 - 12345 345 456 789 1011 - — Sikander Waheed
– Sikander Waheed, Commented Jan 31, 2019 at 15:02

Ic3fr0g · Accepted Answer · 2019-02-03 04:18:40Z

2

As I understand it, you do not need to read the CSV file twice. You essentially want all the rows where MSG occurs in Col6. You can actually achieve this in one line -

MSG = 'This is a test'
iter_csv = pd.read_csv('firstFile.txt', chunksize=1000, usecols=DL_fields1, skiprows=1)
# this gives you all the rows where MSG occurs in Col6
df = iter_csv.loc[iter_csv['Col6'] == MSG, :]
# this gives you all the rows where 12345 in Col1
df_12345 = df.loc[iter_csv['Col1'] == 12345,]

You can create multiple subsets of the data this way.

To answer the second part of your question, you can loop over all text files like so -

import glob
txt_files = glob.glob("test/*.txt")
for file in txt_files:
    with open(file, 'r') as foo:
        some_df = pd.read_csv(file)

EDIT: This is how you loop over the files and find all keys with Col1=12345 and Col6=MSG-

import glob
from functools import reduce

results_list = []
MSG = 'This is a test'

txt_files = glob.glob("test/*.txt")
for file in txt_files:
    with open(file, 'r') as foo:
        some_df = pd.read_csv(file, chunksize=1000, usecols=DL_fields1, skiprows=1)
        df = iter_csv.loc[iter_csv['Col6'] == MSG, :]
        # results_list is a list of all such dataframes
        results_list.append(df.loc[iter_csv['Col1'] == 12345, ])

# All results in one big dataframe
result_df = reduce(lambda x,y: pd.concat([x,y]), results_list)

edited Feb 3, 2019 at 4:18

answered Jan 31, 2019 at 7:36

Ic3fr0g

1,24916 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ic3fr0g Over a year ago

Please feel free to accept this answer if this solution works for you.

Sikander Waheed Over a year ago

Can the following code that you suggested be put in a loop so that it is executed for multiple files instead of just one file? I mean can I replace "firstFile.txt" with "*.txt"? MSG = 'This is a test' iter_csv = pd.read_csv('firstFile.txt', chunksize=1000, usecols=DL_fields1, skiprows=1) # this gives you all the rows where MSG occurs in Col6 df = iter_csv.loc[iter_csv['Col6'] == MSG, :] # this gives you all the rows where 12345 in Col1 df_12345 = df.loc[iter_csv['Col1'] == 12345,]

Collectives™ on Stack Overflow

How to read multiple files with read_csv or lambda

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related