5

I have a list of xml and a for loop that flattens the xml into a pandas dataframe.

The for loop works perfectly fine but is taking very long to flatten the xml, which is getting larger as time goes on.

How do I wrap the below for-loop in executor.map to spread the work load among different cores? I am following this article https://medium.com/@ageitgey/quick-tip-speed-up-your-python-data-processing-scripts-with-process-pools-cf275350163a

for loop to flatten xml:

df1 = pd.DataFrame()
for i in lst:
    print('i am working')
    soup = BeautifulSoup(i, "xml")
    # Get Attributes from all nodes
    attrs = []
    for elm in soup():  # soup() is equivalent to soup.find_all()
        attrs.append(elm.attrs)

    # Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributes
    fields_attribute_list= [x for x in attrs if 'Id' in x.keys()]
    other_attribute_list = [x for x in attrs if 'Id' not in x.keys() and x != {}]

    # Make a single dictionary with the attributes of all nodes except for the `Field` nodes.
    attribute_dict = {}
    for d in other_attribute_list:
        for k, v in d.items():  
            attribute_dict.setdefault(k, v)

    # Update each field row with attributes from all other nodes.
    full_list = []
    for field in fields_attribute_list:
        field.update(attribute_dict)
        full_list.append(field)

    # Make Dataframe
    df = pd.DataFrame(full_list)
    df1 = df1.append(df)

Does the for loop need to be transformed into a function?

1 Answer 1

3

Yes, you do need to transform the loop into a function. The function has to be able to take in just one argument. That one argument could be anything such as a list,tuple,dictionary or whatever. Functions with multiple parameters are a little complex to put into the concurrent.futures.*Executor methods.

This example below should work for you.

from bs4 import BeautifulSoup
import pandas as pd
from concurrent import futures


def create_dataframe(xml):
    soup = BeautifulSoup(xml, "xml")
    # Get Attributes from all nodes
    attrs = []
    for elm in soup():  # soup() is equivalent to soup.find_all()
        attrs.append(elm.attrs)

    # Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributes
    fields_attribute_list = [x for x in attrs if 'FieldId' in x.keys()]
    other_attribute_list = [x for x in attrs if 'FieldId' not in x.keys() and x != {}]

    # Make a single dictionary with the attributes of all nodes except for the `Field` nodes.
    attribute_dict = {}
    for d in other_attribute_list:
        for k, v in d.items():
            attribute_dict.setdefault(k, v)

    # Update each field row with attributes from all other nodes.
    full_list = []
    for field in fields_attribute_list:
        field.update(attribute_dict)
        full_list.append(field)
    print(len(full_list))
    # Make Dataframe
    df = pd.DataFrame(full_list)
    # print(df)
    return df


with futures.ThreadPoolExecutor() as executor:  # Or use ProcessPoolExecutor
    df_list = executor.map(create_dataframe, lst)

df_list = list(df_list)
full_df = pd.concat(list(df_list))
print(full_df)
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for the answer, it works however when I run the for loop by itself I get 5066 rows with 57 columns, if I use your function I get 159 rows and 57 columns. I have 159 xml objects in the list to unpack. Cant figure out what is causing the delta.
Well, I went and checked back to the old answer that i had given. Turns out that in the question that you've posted you've got the line fields_attribute_list= [x for x in attrs if 'Id' in x.keys()] and you have Id instead of FieldId in it. That results in the xml just giving out 1 result. When you substitute Id with FieldId on the xml you've provided earlier, it seems to get you the answer that you require. I've updated the answer to reflect that. Check and tell me if it works?
beautiful, I forgot to change that as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.