5

If I have an input.txt file:

apples    grapes    alpha   pears
chicago paris london 
yellow    blue      red
+++++++++++++++++++++
apples    grapes    beta   pears
chicago paris london 
car   truck  van
+++++++++++++++++++
apples    grapes    gamma   pears
chicago paris london 
white  purple   black
+++++++++++++++++++
apples    grapes    delta   pears
chicago paris london 
car   truck  van

I want to find all rows containing truck as the 2nd string, then return the 3rd string from the row two lines above.

Output would be:

beta
delta

So far, I have this code that finds the row I'd like, then creates a dataframe from the list. What is the best way to continue using Pandas, and get the -2 row/value that I need?

data_list = []

with open('input.txt', 'r') as data:

    for line in data:
        split_row = line.split()
        if len(split_row) > 1 and split_row[1] == "truck":
            data_list.append(split_row)

df = pd.DataFrame(data_list)

print(df.to_string)
2
  • use iter() and next() to select (upto) 4 rows at a time. Inspect the 3rd row to determine if you append the first to a list. Commented Aug 20 at 20:57
  • what do you expect the output to be from your supplied input ? , also, are '+++++++++++++++++++' actual input lines ? Commented Aug 20 at 22:16

3 Answers 3

3

here's a version - using collections deque to handle the lookbehind window. Have tried on a variety of inputs ... i din't bother with the pandas conversion - didn't see the point, but the list produced can easily be converted to a dataframe

i'd originally written this in gawk as its a few lines ... but ... python ... a few more ...

cat yodish.py
#!/usr/bin/env python
import sys
from collections import deque


def main():
    if len(sys.argv) < 2:
        print("Usage: script_name.py inputFile", file=sys.stderr)
        sys.exit(1)

    MAXWINDOW = 3   # how many rows to hold to for the look behind
    window = deque(maxlen=MAXWINDOW)
    results = []
    with open(sys.argv[1], 'r') as inputF:
        for record in inputF:
            fields = record.strip().split()
            if len(fields) < 2:  # skip this record as not enough fields
                continue
    
            window.append(record)
    
            # second field is "truck"
            if len(window) == MAXWINDOW and fields[1] == "truck":
                record_to_process = window[0]
    
                fields_to_process = record_to_process.split()
    
                if len(fields_to_process) >= 3:
                    results.append(fields_to_process[2])

    if results:
        print(results)
        # ... can convert this list to a df if you need to ... 
        # df = pd.DataFrame(results, columns=['lookbehind'])


if __name__ == "__main__":
    main()

# (chmod +x) to make the script 'runnable' 

./yodish.py input.txt  #using your input .... 
['beta', 'delta']
Sign up to request clarification or add additional context in comments.

1 Comment

I stripped this down a bit, to get it to work; but, it's got me on the right track; thanks!
3

May be this variant with grouping will do the thing?

df = df.assign(grp=df[0].str.contains(r"\++").cumsum())
res = df.groupby("grp").apply(lambda x: x.iloc[-3,2] 
                              if "truck"  in x[1].values
                              else None,
                              include_groups=False).dropna()

UPDATE

It came yo my mind that the position of word "truck" is not guaranteed to be on the last row of a block, so more versatile solution would be

idx = []
df = df.assign(grp=df[0].str.contains(r"\++").cumsum())
df.groupby("grp").apply(lambda x: idx.extend((x.index[x.iloc[:,1].eq("truck")]-2).tolist()),
                        include_groups=False) 
res = df.iloc[idx, 2].values
print(res)

1 Comment

I appreciate the solution!
2

A possible solution, based on pandas, which requires to read the text into a dataframe:

# from io import StringIO
#
# text = """
# apples    grapes    alpha   pears
# chicago paris london 
# yellow    blue      red
# +++++++++++++++++++++
# apples    grapes    beta   pears
# chicago paris london 
# car   truck  van
# +++++++++++++++++++
# apples    grapes    gamma   pears
# chicago paris london 
# white  purple   black
# +++++++++++++++++++
# apples    grapes    delta   pears
# chicago paris london 
# car   truck  van
# """

# df = pd.read_csv(StringIO(text), sep=r'\s+', header=None)

out = df.set_axis(df.index % 4)
out.loc[0, 2].loc[out.loc[2, 1].eq('truck').values].to_list()

This works by first re-labeling the dataframe’s index with a repeating pattern 0–3, using set_axis, so that each 4-line block is consistently marked (with row label 0 for the header line, etc.. Then, it selects the candidate lines with out.loc[2, 1] and checks which ones have the second entry equal to truck by applying eq. The boolean mask generated is aligned positionally using .values and applied to the corresponding entries two lines above, obtained from loc as out.loc[0, 2]. Finally, the matches are converted into a list with to_list.

Output:

['beta', 'delta']

7 Comments

I'll give it a shot, thanks! Is there any chance this could be implemented without read_csv? I am unable to use that and must loop through the input to create the dataframe and/or list
also, the larger dataset that i'm working against is not consistently blocks of 4
Maybe my code can be adapted to your setting. How do you get the data? Do the data come in slots?
It's fairly structured; but, there are instances where it is not (and there are multiple 'truck' rows instead of 1)
I suspect my approach still works in that setting. I can have a look if you post a realistic dataset.
line 391, in get_loc return self._range.index(new_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: 2 is not in range (i'm still looking, I appreciate your solution!)
Without a real piece of your dataset, it is impossible to adapt my code to your needs. But maybe the other solutions here will work for you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.