1

I have two csv files imported as dataframes A and C. I want to match the strings of column content with the entry in data.data that contains the string from A.

A  time_a content    C  time_c data.data
   100    f00           400    otherf00other
   101    ba7           402    onlyrandom
   102    4242          407    otherba7other
                        409    other4242other

Should become:
time_a time_c content
100    400    f00
101    407    ba7
102    409    4242

My solution below uses iterators. But it works too slow. This answer explains why and gives methods how to improve. But I struggle to implement any.

How can I do this with the optimized methods from pandas?

# reset_index() on both df
df_alsa_copy = df_alsa.copy() # Never modify your iterator
df_alsa_copy['cap_fno'] = -1

for aIndex, aRow in df_alsa.iterrows():
    for cIndex, cRow in df_c.iterrows():
        if str(aRow['content']) in str(cRow['data.data']):
            df_alsa_copy.loc[aIndex, 'cap_fno'] = df_c.loc[cIndex, 'frame.number']
# https://stackoverflow.com/questions/31528819/using-merge-on-a-column-and-index-in-pandas
# Merge on frame.number column (bc I chose it to be included in alsa_copy as a column)
df_ltnc = pd.merge(df_alsa_copy, df_c, left_on='cap_fno', right_on='frame.number')

Also tried:

  • Would work, if there is an exact match: Pandas: Join dataframe with condition.
  • I also managed to match my second frame against a known string with series.str.contains.
  • The problem is, I fail to enter a dataframe column to match in merge on=. I can only enter a known string.
  • The same problem arose, when I used apply.
  • I did not succeed with isin or similar.

More info:

A holds timestamped content I fed into the program. C is a network capture. I want to know the time in between feeding and capture. I assume:

  • The string occur in the same order in A and C.
  • But in C there might be lines in between.
  • Strings represent hex values.
  • data.data contains other chars as well as the string I look for.
  • Maybe I lack the pandas vocabulary, to look for the correct method.
6
  • Nvm, Had no louck on google. But I found a solution which is a bit faster through SO related questions: stackoverflow.com/a/54768170/17834402. Not sure if it is seriously that hard to find or if it was just bad luck. Any advise: Delete this question or leave as duplicate, so others can find the original faster? Commented Mar 9, 2023 at 11:53
  • Which of the approaches from the link are you referring to? Commented Mar 9, 2023 at 12:25
  • The new one with apply, bfill, lambda expression etc. Unfortunately, I do not fully understand how it works Commented Mar 9, 2023 at 12:27
  • 1
    if it is preformant, happy to see it. I could compare performance of both solutions then Commented Mar 9, 2023 at 12:33
  • 1
    I post an answer in a moment Commented Mar 9, 2023 at 12:40

2 Answers 2

2

Try this approach using pandas.unique(), pandas.Series.str.contains and pandas.DataFrame.merge

unique_str = A['content'].unique()
matching_rows = C[C['data.data'].str.contains('|'.join(unique_str))]

out = pd.merge(matching_rows, A, left_on=matching_rows['data.data']
               .str.extract(f'({"|".join(unique_str)})')[0],
                right_on='content')[['time_a', 'time_c', 'content']]
print(out)

   time_a  time_c content
0     100     400     f00
1     101     407     ba7
2     102     409    4242
Sign up to request clarification or add additional context in comments.

4 Comments

Works on the sample data provided. Still issues with my real data. Will report on perfomance, once sorted out.
This approach is expected to be more efficient but because I don't know the structure of your real data it because almost impossible to help you resolve those issues.
Thanks, but I think I figured it out. TLDR: My data has a problem, not the answer. LONG: You answered, what I asked for. Your answer (correctly) assumes that 'unique_str' does not occur in mmulitple rows in C. However, in my data it does. My other coding approach just took the first appearance of the unique_str, which made it look correct. But your code actually gets the correct result and showed a flaw in my data.
Good to know that @PaulSmith
1

if you want to improve speed, a different option you might consider is using Polars (https://www.pola.rs/). You can install Polars with pip install polars

The solution is the same to what @Jamiu proposes, I think he has the right approach. The only difference is Polars instead of Pandas.

I tested the 2 solutions by multiplying the number of rows by 1000. The Pandas solution takes 400ms while the Polars one takes 92ms.

import polars as pl

# convert the data to Polars dataframes
a, c = pl.from_pandas(A), pl.from_pandas(C)

# calculate unique values and join
unique_values = f"({a['content'].unique().str.concat('|').item()})"
out = (
    a
    .join(c.filter(pl.col('data.data').str.contains(unique_values)), 
    left_on = 'content', right_on = pl.col('data.data').str.extract(unique_values))
)

# convert back to Pandas if needed
out_pandas = out.to_pandas()

1 Comment

Thanks for the hint, I will keep it in mind, when I need more performance. Right now, speed is okay. Jamiu helped get rid of the iterator. That increased performance by some orders of mangitude already.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.