Join big dataframes based on partial string-match between columns

Ask Question

Asked 3 years, 9 months ago

Modified 3 years, 9 months ago

Viewed 90 times

Two DataFrames have gene and isoform names that are not formatted the same way. I'd like to do a join and add the df2 columns name, isoform for all partial string matches between the isoform (df2) and the name (df1) in both DataFrames. df2 is a key for the isoforms/genes, where a gene can have many isoforms. In df1, basically an output from a gene-quantification software (SALMON) the name field has both, the gene and isoform in it. I cant use regex since isoforms have variable suffixs, such as ".","_", "-", and many others. Another important piece of information is that each df1["Name"] cell has a unique isoform.

Piece of dfs to merge:

import pandas as pd

df1 = pd.DataFrame({'Name': {0: 'AT1G01010;AT1G01010.1;Isoseq::Chr1:3616-5846', 1: 'AT1G01010;AT1G01010_2;Isoseq::Chr1:3630-5894', 2: 'AT1G01010;AT1G01010.3;Isoseq::Chr1:3635-5849', 3: 'AT1G01020;AT1G01020.11;Isoseq::Chr1:6803-8713', 4: 'AT1G01020;AT1G01020.13;Isoseq::Chr1:6811-8713'}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
df2 = pd.DataFrame({'gene': {0: 'AT1G01010', 14: 'AT1G01010', 30: 'AT1G01010', 46: 'AT1G01020', 62: 'AT1G01020', 80: 'AT1G01020', 100: 'AT1G01020', 116: 'AT1G01020', 138: 'AT1G01020', 156: 'AT1G01020'}, 'isoform': {0: 'AT1G01010.1', 14: 'AT1G01010_2', 30: 'AT1G01010.3', 46: 'AT1G01020.1', 62: 'AT1G01020.10', 80: 'AT1G01020.11', 100: 'AT1G01020.12', 116: 'AT1G01020.13', 138: 'AT1G01020.14', 156: 'AT1G01020.15'}})
display(df1)
display(df2)

Desired output:

df3 = pd.DataFrame({'gene': {0: 'AT1G01010', 1:"AT1G01010", 2:"AT1G01010", 3:"AT1G01020", 4:"AT1G01020"},'isoform': {0: 'AT1G01010.1',1:"AT1G01010_2", 2:"AT1G01010.3", 3:"AT1G01020.11", 4:"AT1G01020.13"}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
#"Name" column from df1 is not necessary anymore. (the idea is to replace it for gene and isoform)
display(df3)

Real dfs size:

df1 = 143646 rows × 5 columns

df2 = 169499 rows × 2 columns

(since df1 may not have all the isoforms detected, it's always smaller than df2)

I tried some answers i found online, but since this dfs have a huge size, many need 50gb of RAM or so...

Already checked: Merge Dataframes Based on Partial Substrings Match, Join to Dataframes based on partial string matches in python, Join dataframes based on partial string-match between columns

Thanks for the help!

edited Mar 2, 2022 at 14:08

asked Mar 2, 2022 at 13:01

Lucas Servi

34 bronze badges

Is it expected that AT1G01020.1 matches AT1G01020.11? and AT1G01020.10 matches AT1G01020.13?

mozway
– mozway

2022-03-02 13:08:18 +00:00
Commented Mar 2, 2022 at 13:08
No, sorry (edited): AT1G01020.11 and AT1G01020.1 are different isoforms. I added more rows to df2 to clarify this example. Thanks!

Lucas Servi
– Lucas Servi

2022-03-02 13:12:50 +00:00
Commented Mar 2, 2022 at 13:12
so, to be clear you want a full match on gene and isoform?

mozway
– mozway

2022-03-02 13:13:54 +00:00
Commented Mar 2, 2022 at 13:13
Yes. Partial is requested since df1 has a large field that may not always be separated by ";". Isoform suffixes may also contain leters such as "_ID2".

Lucas Servi
– Lucas Servi

2022-03-02 13:16:55 +00:00
Commented Mar 2, 2022 at 13:16
How else can it be separated? You need to come a with a logic at some point ;)

mozway
– mozway

2022-03-02 13:17:38 +00:00
Commented Mar 2, 2022 at 13:17

| Show 1 more comment

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Join big dataframes based on partial string-match between columns

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked