I am reading the same text file twice via read_csv. First time to get a list of keys that match a specific string (MSG) with 'Col6' in that file. This will give me a dataframe with only those entries that match 'Col6'. Then second time I read the same file(again with read_csv) and print some more columns if key1 == key2, which are based on 'Col1'.
I have basically two questions:
1. Can I combine both searches (read_csv) together?
2. Even if I keep the two read_csv separate, how can I read multiple files? Right now I am reading only one file (firstFile.txt), but I would like to replace the file name with '*.txt' so that read_csv operations are performed for all *.txt files in the directory.
Data file looks like following. I want to print all rows with Col1=12345 since Col6 has value 'This is a test'.
Col1 Col2 Col3 Col4 Col5 Col6 - - - - - - 54321 544 657 888 4476 - 12345 345 456 789 1011 'This is a test' 54321 644 857 788 736 - 54321 744 687 898 7436 - 12345 365 856 789 1020 - 12345 385 956 689 1043 - 12345 385 556 889 1055 - 65432 444 676 876 4554 - - - - - - - 54321 544 657 888 776 - 12345 345 456 789 1011 - 54321 587 677 856 7076 - 12345 345 456 789 1011 - 65432 444 676 876 455 - 12345 345 456 789 1011 - 65432 447 776 576 4055 - - - - - - - 65432 434 376 576 4155 -
The script that I used is:
import csv
import pandas as pd
import os
import glob
DL_fields1 = ['Col1', 'Col2']
DL_fields2 = ['Col1', 'Col2','Col3', 'Col4', 'Col5', 'Col6']
MSG = 'This is a test'
iter_csv = pd.read_csv('firstFile.txt', chunksize=1000, usecols=DL_fields1, skiprows=1)
df = pd.concat([chunk[chunk['Special_message'] == MSG] for chunk in iter_csv])
for i, row in df.iterrows():
key1 = df.loc[i, 'Col1']
j=0
for line in pd.read_csv('firstFile.txt', chunksize=1, usecols=DL_fields2, skiprows=1, na_values={'a':'Int64'}):
key2 = line.loc[j,'Col1']
j = j + 1
if (key2 == '-'):
continue
elif (int(key1) == int(key2)):
print (line)