1

I have about 10,000 of XML files with similar structure that I wish to convert to a single CSV file. Each XML file looks like this:

<?xml version='1.0' encoding='UTF-8'?>
<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
    <S:Body>
        <ns7:GetStopMonitoringServiceResponse xmlns:ns3="http://www.siri.org.uk/siri" xmlns:ns4="http://www.ifopt.org.uk/acsb" xmlns:ns5="http://www.ifopt.org.uk/ifopt" xmlns:ns6="http://datex2.eu/schema/1_0/1_0" xmlns:ns7="http://new.webservice.namespace">
            <Answer>
                <ns3:ResponseTimestamp>2019-03-31T09:00:52.912+03:00</ns3:ResponseTimestamp>
                <ns3:ProducerRef>ISR Siri Server (141.10)</ns3:ProducerRef>
                <ns3:ResponseMessageIdentifier>276480603</ns3:ResponseMessageIdentifier>
                <ns3:RequestMessageRef>0100700:1351669188:4684</ns3:RequestMessageRef>
                <ns3:Status>true</ns3:Status>
                <ns3:StopMonitoringDelivery version="IL2.71">
                    <ns3:ResponseTimestamp>2019-03-31T09:00:52.912+03:00</ns3:ResponseTimestamp>
                    <ns3:Status>true</ns3:Status>
                    <ns3:MonitoredStopVisit>
                        <ns3:RecordedAtTime>2019-03-31T09:00:52.000+03:00</ns3:RecordedAtTime>
                        <ns3:ItemIdentifier>-881202701</ns3:ItemIdentifier>
                        <ns3:MonitoringRef>20902</ns3:MonitoringRef>
                        <ns3:MonitoredVehicleJourney>
                            <ns3:LineRef>23925</ns3:LineRef>
                            <ns3:DirectionRef>2</ns3:DirectionRef>
                            <ns3:FramedVehicleJourneyRef>
                                <ns3:DataFrameRef>2019-03-31</ns3:DataFrameRef>
                                <ns3:DatedVehicleJourneyRef>36962685</ns3:DatedVehicleJourneyRef>
                            </ns3:FramedVehicleJourneyRef>
                            <ns3:PublishedLineName>15</ns3:PublishedLineName>
                            <ns3:OperatorRef>15</ns3:OperatorRef>
                            <ns3:DestinationRef>26020</ns3:DestinationRef>
                            <ns3:OriginAimedDepartureTime>2019-03-31T08:35:00.000+03:00</ns3:OriginAimedDepartureTime>
                            <ns3:VehicleLocation>
                                <ns3:Longitude>34.78000259399414</ns3:Longitude>
                                <ns3:Latitude>32.042293548583984</ns3:Latitude>
                            </ns3:VehicleLocation>
                            <ns3:VehicleRef>37629301</ns3:VehicleRef>
                            <ns3:MonitoredCall>
                                <ns3:StopPointRef>20902</ns3:StopPointRef>
                                <ns3:ExpectedArrivalTime>2019-03-31T09:03:00.000+03:00</ns3:ExpectedArrivalTime>
                            </ns3:MonitoredCall>
                        </ns3:MonitoredVehicleJourney>
                    </ns3:MonitoredStopVisit>
                </ns3:StopMonitoringDelivery>
            </Answer>
        </ns7:GetStopMonitoringServiceResponse>
    </S:Body>
</S:Envelope>

The example above shows one MonitoredStopVisit nested tag, but every XML have about 4,000 of them. Full XML as an example can be found here.

I want to convert all the 10K files to one CSV where each record corresponds to a MonitoredStopVisit tag, so the CSV should look like this: generated CSV

Currently this is my architecture:

  • split the 10K files into 8 chunks (per my PC cores).
  • Each sub-process iterates through its xml files and objectifies the xml.
  • The object is then iterated, and per each element I use conditions to exclude/include data using an array.
  • When the tag is /ns3:MonitoredStopVisit, the array is appended to a pandas dataframe as a series.
  • When all sub-processes are done, the dataframes are merged and saved as CSV.

This is the xml to df code:

def xml_to_df(xml_file):
    from lxml import objectify
    xml_content = xml_file.read()
    obj = objectify.fromstring(xml_content)
    df_cols=[
        'RecordedAtTime',
        'MonitoringRef',
        'LineRef',
        'DirectionRef',
        'PublishedLineName',
        'OperatorRef',
        'DestinationRef',
        'OriginAimedDepartureTime',
        'Longitude',
        'Latitude',
        'VehicleRef',
        'StopPointRef',
        'ExpectedArrivalTime',
        'AimedArrivalTime'
        ]
    tempdf = pd.DataFrame(columns=df_cols)
    arr_of_vals = [""] * 14

    for i in obj.getiterator():
        if "MonitoredStopVisit" in i.tag or "Status" in i.tag and "false" in str(i):
            if arr_of_vals[0] != "" and (arr_of_vals[8] and arr_of_vals[9]):
                s = pd.Series(arr_of_vals, index=df_cols)
                if tempdf[(tempdf==s).all(axis=1)].empty:
                    tempdf = tempdf.append(s, ignore_index=True)
                    arr_of_vals =  [""] * 14
        elif "RecordedAtTime" in i.tag:
            arr_of_vals[0] = str(i)
        elif "MonitoringRef" in i.tag:
            arr_of_vals[1] = str(i)
        elif "LineRef" in i.tag:
            arr_of_vals[2] = str(i)
        elif "DestinationRef" in i.tag:
            arr_of_vals[6] = str(i)
        elif "OriginAimedDepartureTime" in i.tag:
            arr_of_vals[7] = str(i)
        elif "Longitude" in i.tag:
            if str(i) == "345353":
                print("Lon: " + str(i))
            arr_of_vals[8] = str(i)
        elif "Latitude" in i.tag:
            arr_of_vals[9] = str(i)
        elif "VehicleRef" in i.tag:
            arr_of_vals[10] = str(i)
        elif "ExpectedArrivalTime" in i.tag:
            arr_of_vals[12] = str(i)

    if arr_of_vals[0] != "" and (arr_of_vals[8] and arr_of_vals[9]):  
        s = pd.Series(arr_of_vals, index=df_cols)
        if tempdf[(tempdf == s).all(axis=1)].empty:
            tempdf = tempdf.append(s, ignore_index=True)
    return tempdf

The problem is that for 10K files this takes about 10 hours with 8 sub-processors. When checking CPU/Mem usage, I can see that are not fully utilized.

Any idea how this can be improved? My next step is threading, but maybe there are other applicable ways. Just as a note, the order of records isn't important - I can sort it later.

3
  • post structure of xml with more then one tag Commented Jun 4, 2019 at 12:34
  • Added a link to a full XML: wetransfer.com/downloads/… Commented Jun 4, 2019 at 12:58
  • The link to the XML example is broken. Commented Jun 4, 2019 at 15:54

3 Answers 3

1

Here is my solution with pandas:

Computation time for each 5Mb file is about 0.4s

import xml.etree.ElementTree as ET
import re
import pandas as pd
import os



def collect_data(xml_file):
    # create xml object
    root = ET.parse(xml_file).getroot()

    # collect raw data
    out_data = []
    for element in root.iter():
        # get tag name
        tag = re.sub('{.*?}', '', element.tag)
        # add break segment element
        if tag == 'RecordedAtTime':
            out_data.append('break')

        if tag in tag_list:
            out_data.append((tag, element.text))

    # get break indexes
    break_index = [i for i, x in enumerate(out_data) if x == 'break']

    # break list into parts
    list_data = []
    for i in range(len(break_index) - 1):
        list_data.append(out_data[break_index[i]:break_index[i + 1]])

    # check for each value in data
    final_data = []
    for item in list_data:
        # delete bleak element ad convert list into dictionary
        del item[item.index('break')]
        data_dictionary = dict(item)

        if 'RecordedAtTime' in data_dictionary.keys():
            recorded_at_time = data_dictionary.get('RecordedAtTime')
        else:
            recorded_at_time = ''

        if 'MonitoringRef' in data_dictionary.keys():
            monitoring_ref = data_dictionary.get('MonitoringRef')
        else:
            monitoring_ref = ''

        if 'LineRef' in data_dictionary.keys():
            line_ref = data_dictionary.get('LineRef')
        else:
            line_ref = ''

        if 'DirectionRef' in data_dictionary.keys():
            direction_ref = data_dictionary.get('DirectionReff')
        else:
            direction_ref = ''

        if 'PublishedLineName' in data_dictionary.keys():
            published_line_name = data_dictionary.get('PublishedLineName')
        else:
            published_line_name = ''

        if 'OperatorRef' in data_dictionary.keys():
            operator_ref = data_dictionary.get('OperatorRef')
        else:
            operator_ref = ''

        if 'DestinationRef' in data_dictionary.keys():
            destination_ref = data_dictionary.get('DestinationRef')
        else:
            destination_ref = ''

        if 'OriginAimedDepartureTime' in data_dictionary.keys():
            origin_aimed_departure_time = data_dictionary.get('OriginAimedDepartureTime')
        else:
            origin_aimed_departure_time = ''

        if 'Longitude' in data_dictionary.keys():
            longitude = data_dictionary.get('Longitude')
        else:
            longitude = ''

        if 'Latitude' in data_dictionary.keys():
            latitude = data_dictionary.get('Latitude')
        else:
            latitude = ''

        if 'VehicleRef' in data_dictionary.keys():
            vehicle_ref = data_dictionary.get('VehicleRef')
        else:
            vehicle_ref = ''

        if 'StopPointRef' in data_dictionary.keys():
            stop_point_ref = data_dictionary.get('StopPointRef')
        else:
            stop_point_ref = ''

        if 'ExpectedArrivalTime' in data_dictionary.keys():
            expected_arrival_time = data_dictionary.get('ExpectedArrivalTime')
        else:
            expected_arrival_time = ''

        if 'AimedArrivalTime' in data_dictionary.keys():
            aimed_arrival_time = data_dictionary.get('AimedArrivalTime')
        else:
            aimed_arrival_time = ''

        final_data.append((recorded_at_time, monitoring_ref, line_ref, direction_ref, published_line_name, operator_ref,
                       destination_ref, origin_aimed_departure_time, longitude, latitude, vehicle_ref,
                       stop_point_ref,
                       expected_arrival_time, aimed_arrival_time))

     return final_data


# setup tags list for checking
tag_list = ['RecordedAtTime', 'MonitoringRef', 'LineRef', 'DirectionRef', 'PublishedLineName', 'OperatorRef',
            'DestinationRef', 'OriginAimedDepartureTime', 'Longitude', 'Latitude', 'VehicleRef', 'StopPointRef',
            'ExpectedArrivalTime', 'AimedArrivalTime']

# collect data from each file
save_data = []
for file_name in os.listdir(os.getcwd()):
    if file_name.endswith('.xml'):
        save_data.append(collect_data(file_name))
    else:
        pass

# merge list of lists
flat_list = []
for sublist in save_data:
    for item in sublist:
        flat_list.append(item)

# load data into data frame
data = pd.DataFrame(flat_list, columns=tag_list)

# save data to file
data.to_csv('data.csv', index=False)
Sign up to request clarification or add additional context in comments.

2 Comments

Did you happen to try it for 10K files? Just out of curiosity
@Shakedk I'm tested on 1000 files and average time for one files is: 0.59 sek. I think you can modify my code to get faster time.
0

So it seems the issue is the use of the Pandas dataframe and series. Using the code above, processing one xml file with ~4000 records took 4-120 seconds. The time increased as the program kept working.

Using python lists or numpy matrices (more convenient for working into a csv) decreased the running time significantly - each xml file processing now takes 0.1-0.5 seconds tops.

I used the following code to append the new processed records each time

records = np.append(records, new_redocrds, axis=0)

This is equivalent to:

tempdf = tempdf.append(s, ignore_index=True)

but significantly faster.

Hope this helps anyone who might encounter similar issues!

3 Comments

Of course your Pandas code increases with time as you are appending a data frame in a loop which you should avoid since it leads to quadratic copying.
I'm doing the same with numpy arrays: 'bs_records = np.append(bs_records, records, axis=0)' and it is faster by far as I mentioned... Maybe for some people it's obvious, but for people not used to pandas/numpy this can be helpful to know.
Look into XSLT which can transform XML to CSV. No need for appending lists, matrices, or data frames.
0

Actually consider XSLT, the special-purpose language to transform XML files into other XML even text files such as CSV. The only third-party library needed is Python's lxml which can run XSLT 1.0 scripts leaving out the heavier, extensive analytical tools such as Pandas and Numpy.

In fact, because XSLT is a separate, industry language, it is portable and can be run in any language with XSLT library (e.g., Java, PHP, Perl, C#, VB) or standalone 1.0, 2.0, or 3.0 processors (e.g., Xalan, Saxon), all of which Python can call as a command line subprocess.

XSLT (save below as a .xsl file, a special .xml file)

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"
                              xmlns:ns3="http://www.siri.org.uk/siri" 
                              xmlns:ns4="http://www.ifopt.org.uk/acsb" 
                              xmlns:ns5="http://www.ifopt.org.uk/ifopt" 
                              xmlns:ns6="http://datex2.eu/schema/1_0/1_0" 
                              xmlns:ns7="http://new.webservice.namespace">

   <xsl:output method="text" indent="yes" omit-xml-declaration="yes"/>
   <xsl:strip-space elements="*"/>

   <xsl:template match ="/S:Envelope/S:Body/ns7:GetStopMonitoringServiceResponse/Answer">
       <xsl:apply-templates select="ns3:StopMonitoringDelivery"/>
   </xsl:template>

   <xsl:template match="ns3:StopMonitoringDelivery">
        <!-- HEADERS -->
        <!-- <xsl:text>RecordedAtTime,MonitoringRef,LineRef,DirectionRef,PublishedLineName,OperatorRef,DestinationRef,OriginAimedDepartureTime,Longitude,Latitude,VehicleRef,StopPointRef,ExpectedArrivalTime,AimedArrivalTime&#xa;</xsl:text> -->
        <xsl:apply-templates select="ns3:MonitoredStopVisit"/>
        <xsl:text>&#xa;</xsl:text>
   </xsl:template>

   <xsl:template match="ns3:MonitoredStopVisit">
       <xsl:variable name="delim">,</xsl:variable>
       <xsl:variable name="quote">&quot;</xsl:variable>
       <!-- DATA ROWS -->
       <xsl:value-of select="concat($quote, ns3:RecordedAtTime, $quote, $delim,
                                    $quote, ns3:MonitoringRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:LineRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:DirectionRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:PublishedLineName, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:OperatorRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:DestinationRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:OriginAimedDepartureTime, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:VehicleLocation/ns3:Longitude, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:VehicleLocation/ns3:Latitude, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:VehicleRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:StopPointRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:ExpectedArrivalTime, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:AimedArrivalTime, $quote, $delim
                                    )"/>
   </xsl:template>

</xsl:stylesheet>

Online Demo

Python (no appending lists, arrays, or dataframes)

import glob                 # TO RETRIEVE ALL XML FILES
import lxml.etree as et     # TO PARSE XML AND RUN XSLT

xml_path = "/path/to/xml/files"

# PARSE XSLT
xsl = et.parse('XSLTScript.xsl')

# BUILD CSV
with open("MonitoredStopVisits.csv", 'w') as f:
    # HEADER
    f.write('RecordedAtTime,MonitoringRef,LineRef,DirectionRef,PublishedLineName,'
            'OperatorRef,DestinationRef,OriginAimedDepartureTime,Longitude,Latitude,'
            'VehicleRef,StopPointRef,ExpectedArrivalTime,AimedArrivalTime\n')

    # DATA ROWS
    for f in glob.glob(xml_path + "/**/*.xml", recursive=True):
        # LOAD XML AND XSL SCRIPT
        xml = et.parse(f)

        # TRANSFORM XML TO STRING RESULT TREE
        transform = et.XSLT(xsl)
        result = str(transform(xml))

        # WRITE TO CSV
        f.write(result)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.