Pandas - return the -2 row

Question

If I have an input.txt file:

apples    grapes    alpha   pears
chicago paris london 
yellow    blue      red
+++++++++++++++++++++
apples    grapes    beta   pears
chicago paris london 
car   truck  van
+++++++++++++++++++
apples    grapes    gamma   pears
chicago paris london 
white  purple   black
+++++++++++++++++++
apples    grapes    delta   pears
chicago paris london 
car   truck  van

I want to find all rows containing truck as the 2nd string, then return the 3rd string from the row two lines above.

Output would be:

beta
delta

So far, I have this code that finds the row I'd like, then creates a dataframe from the list. What is the best way to continue using Pandas, and get the -2 row/value that I need?

data_list = []

with open('input.txt', 'r') as data:

    for line in data:
        split_row = line.split()
        if len(split_row) > 1 and split_row[1] == "truck":
            data_list.append(split_row)

df = pd.DataFrame(data_list)

print(df.to_string)

use iter() and next() to select (upto) 4 rows at a time. Inspect the 3rd row to determine if you append the first to a list. — JonSG
– JonSG, Commented Aug 20 at 20:57
what do you expect the output to be from your supplied input ? , also, are '+++++++++++++++++++' actual input lines ? — ticktalk
– ticktalk, Commented Aug 20 at 22:16

ticktalk · Accepted Answer · 2025-08-21 10:02:04Z

here's a version - using collections deque to handle the lookbehind window. Have tried on a variety of inputs ... i din't bother with the pandas conversion - didn't see the point, but the list produced can easily be converted to a dataframe

i'd originally written this in gawk as its a few lines ... but ... python ... a few more ...

cat yodish.py
#!/usr/bin/env python
import sys
from collections import deque


def main():
    if len(sys.argv) < 2:
        print("Usage: script_name.py inputFile", file=sys.stderr)
        sys.exit(1)

    MAXWINDOW = 3   # how many rows to hold to for the look behind
    window = deque(maxlen=MAXWINDOW)
    results = []
    with open(sys.argv[1], 'r') as inputF:
        for record in inputF:
            fields = record.strip().split()
            if len(fields) < 2:  # skip this record as not enough fields
                continue
    
            window.append(record)
    
            # second field is "truck"
            if len(window) == MAXWINDOW and fields[1] == "truck":
                record_to_process = window[0]
    
                fields_to_process = record_to_process.split()
    
                if len(fields_to_process) >= 3:
                    results.append(fields_to_process[2])

    if results:
        print(results)
        # ... can convert this list to a df if you need to ... 
        # df = pd.DataFrame(results, columns=['lookbehind'])


if __name__ == "__main__":
    main()

# (chmod +x) to make the script 'runnable' 

./yodish.py input.txt  #using your input .... 
['beta', 'delta']

I stripped this down a bit, to get it to work; but, it's got me on the right track; thanks!

strawdog · Accepted Answer · 2025-08-21 09:31:29Z

3

May be this variant with grouping will do the thing?

df = df.assign(grp=df[0].str.contains(r"\++").cumsum())
res = df.groupby("grp").apply(lambda x: x.iloc[-3,2] 
                              if "truck"  in x[1].values
                              else None,
                              include_groups=False).dropna()

UPDATE

It came yo my mind that the position of word "truck" is not guaranteed to be on the last row of a block, so more versatile solution would be

idx = []
df = df.assign(grp=df[0].str.contains(r"\++").cumsum())
df.groupby("grp").apply(lambda x: idx.extend((x.index[x.iloc[:,1].eq("truck")]-2).tolist()),
                        include_groups=False) 
res = df.iloc[idx, 2].values
print(res)

edited Aug 21 at 9:31

answered Aug 21 at 9:06

strawdog

7321 gold badge7 silver badges14 bronze badges

1 Comment

yodish Aug 21 at 14:06

I appreciate the solution!

PaulS · Accepted Answer · 2025-08-20 22:54:02Z

2

A possible solution, based on pandas, which requires to read the text into a dataframe:

# from io import StringIO
#
# text = """
# apples    grapes    alpha   pears
# chicago paris london 
# yellow    blue      red
# +++++++++++++++++++++
# apples    grapes    beta   pears
# chicago paris london 
# car   truck  van
# +++++++++++++++++++
# apples    grapes    gamma   pears
# chicago paris london 
# white  purple   black
# +++++++++++++++++++
# apples    grapes    delta   pears
# chicago paris london 
# car   truck  van
# """

# df = pd.read_csv(StringIO(text), sep=r'\s+', header=None)

out = df.set_axis(df.index % 4)
out.loc[0, 2].loc[out.loc[2, 1].eq('truck').values].to_list()

This works by first re-labeling the dataframe’s index with a repeating pattern 0–3, using set_axis, so that each 4-line block is consistently marked (with row label 0 for the header line, etc.. Then, it selects the candidate lines with out.loc[2, 1] and checks which ones have the second entry equal to truck by applying eq. The boolean mask generated is aligned positionally using .values and applied to the corresponding entries two lines above, obtained from loc as out.loc[0, 2]. Finally, the matches are converted into a list with to_list.

Output:

['beta', 'delta']

edited Aug 20 at 22:54

answered Aug 20 at 22:30

PaulS

27.1k3 gold badges19 silver badges40 bronze badges

7 Comments

yodish Aug 21 at 0:48

I'll give it a shot, thanks! Is there any chance this could be implemented without read_csv? I am unable to use that and must loop through the input to create the dataframe and/or list

yodish Aug 21 at 0:55

also, the larger dataset that i'm working against is not consistently blocks of 4

PaulS Aug 21 at 10:17

Maybe my code can be adapted to your setting. How do you get the data? Do the data come in slots?

yodish Aug 21 at 12:42

It's fairly structured; but, there are instances where it is not (and there are multiple 'truck' rows instead of 1)

PaulS Aug 21 at 12:50

I suspect my approach still works in that setting. I can have a look if you post a realistic dataset.

yodish Aug 21 at 12:54

line 391, in get_loc return self._range.index(new_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: 2 is not in range (i'm still looking, I appreciate your solution!)

PaulS Aug 21 at 12:58

Without a real piece of your dataset, it is impossible to adapt my code to your needs. But maybe the other solutions here will work for you!

Collectives™ on Stack Overflow

Pandas - return the -2 row

3 Answers 3

1 Comment

1 Comment

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related