1

I'm trying to read HTML from the following URL into a pandas dataframe:

https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/

The rendered HTML tables look like the following where there are N tables I'm interested in and 1 (the last one) that I'm not (i.e., I'm interested in the ones that don't start with "No secondary metabolite"): enter image description here

When I read HTML via pandas I get 3 tables. Note, the last table from pd.read_html isn't the "No secondary metabolite" table but a concatenated table of the ones I'm interested in prefixed with "NZ_" in the header.

enter image description here

My question is if there is a way to include the headers of the rendered table as a multiindex?

For instance, I'm looking for a resulting table that looks like this:

# Read HTML Tables
dataframes = pd.read_html("https://antismash-db.secondarymetabolites.org/output/GCF_006385935.1/")

# Set Region as the index
dataframes = list(map(lambda df: df.set_index("Region"), dataframes))

# Manual prepending of title and table headers, respectively
dataframes[0].index = dataframes[0].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041066.1", x))
dataframes[1].index = dataframes[1].index.map(lambda x: ("GCF_006385935.1", "NZ_CP041065.1", x))

# Concatenate tables
df_concat = pd.concat(dataframes[:-1], axis=0)

# Replace &nbsp characters with _
df_concat.index = df_concat.index.map(lambda x: (x[0], x[1], x[2].replace("&nbsp","_")))

# Multiindex labels
df_concat.index.names = ["level_0", "level_1", "level_2"]
df_concat

enter image description here

1 Answer 1

2

Try beautifulsoup to parse the HTML and construct the final dataframe:

import requests
import pandas as pd
from bs4 import BeautifulSoup


id_ = "GCF_006385935.1"
url = f"https://antismash-db.secondarymetabolites.org/output/{id_}/"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

dfs = []
for table in soup.select(".record-overview-details table"):
    header = table.find_previous(class_="record-overview-header").text.split()[
        0
    ]
    df = pd.read_html(str(table))[0].assign(level_1=header, level_0=id_)
    dfs.append(df)

final_df = pd.concat(dfs)
final_df = final_df.set_index(["level_0", "level_1", "Region"])
print(final_df)

Prints:

                                                       Type     From       To Most similar known cluster Most similar known cluster.1 Similarity
level_0         level_1       Region                                                                                                            
GCF_006385935.1 NZ_CP041066.1 Region&nbsp1.1        terpene  1123901  1143342                 carotenoid                      Terpene        50%
                              Region&nbsp1.2    phosphonate  1252463  1293980                        NaN                          NaN        NaN
                              Region&nbsp1.3          T3PKS  1944360  1985445                        NaN                          NaN        NaN
                              Region&nbsp1.4        terpene  2690187  2709232                        NaN                          NaN        NaN
                              Region&nbsp1.5        terpene  4260236  4281054                  surfactin              NRP:Lipopeptide        13%
                              Region&nbsp1.6    siderophore  4446861  4463436                        NaN                          NaN        NaN
                NZ_CP041065.1 Region&nbsp3.1  lanthipeptide    98352   124802                        NaN                          NaN        NaN
Sign up to request clarification or add additional context in comments.

1 Comment

Wow, I really need to step up my understanding of bs4. This is incredible! I was trying to do a bunch of str.split methods and it got very messy.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.