Split a large xml file into multiple based on tag in Python

Question

I have a very large xml file which I need to split into several based on a particular tag. The XML file is something like this:

<xml>
<file id="13">
  <head>
    <talkid>2458</talkid>
    <transcription>
      <seekvideo id="645">So in college,</seekvideo>
      ...
    </transcription>
  </head>
  <content> *** This is the content I am trying to save *** </content>
</file>
<file>
      ... 
</file>
</xml>

I want to extract the content of each file and save based on the talkid.

Here is the code I have tried with:

import xml.etree.ElementTree as ET

all_talks = 'path\\to\\big\\file'

context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
    if elem.tag == 'file':
        content = elem.find('content').text
        title = elem.find('talkid').text
        filename = format(title + ".txt")
        with open(filename, 'wb', encoding='utf-8') as f:
            f.write(ET.tostring(content), encoding='utf-8')

But I get the following error:

AttributeError: 'NoneType' object has no attribute 'text'

Martijn Pieters · Accepted Answer · 2022-12-24 12:21:43Z

1

If you're already using .iterparse() it's more generic to rely just on events:

import xml.etree.ElementTree as ET
from pathlib import Path

all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))

for event, element in context:
    if event == 'end':
        if element.tag == 'talkid':
            title = element.text
        elif element.tag == 'content':
            content = element.text
        elif element.tag == 'file' and title and content:
            with open(all_talks.with_name(title + '.txt'), 'w') as f:
                f.write(content)
    elif element.tag == 'file':
        content = title = None

Upd. In similar question @Leila asked how to write text from all <seekvideo> tags to file instead of <content> to file, so here is a solution:

import xml.etree.ElementTree as ET
from pathlib import Path

all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))

for event, element in context:
    if event == 'end':
        if element.tag == 'file' and title and parts:
            with open(all_talks.with_name(title + '.txt'), 'w') as f:
                f.write('\n'.join(parts))
        elif element.text:
            if element.tag == 'talkid':
                title = element.text
            elif element.tag == 'seekvideo':
                parts.append(element.text)
    elif element.tag == 'file':
        title = None
        parts = []

edited Dec 24, 2022 at 12:21

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

answered Oct 24, 2022 at 11:01

Olvin Roght

7,8432 gold badges19 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Leila Over a year ago

What if I want to extract transcription lines, however, without <seekvideo> tags? Could you please help me with that?

Olvin Roght Over a year ago

@Leila, add one more condition elif element.tag == 'transcription':

Leila Over a year ago

That doesn't work. The output is blank. I also tried another way with findall(), but again didn't work. Added it as a new question. stackoverflow.com/questions/74182062/…

Olvin Roght Over a year ago

@Leila, try this.

Olvin Roght Over a year ago

@Leila, I've edited code to avoid tags with no text.

|

Alexander · Accepted Answer · 2022-10-24 10:47:11Z

1

Try doing it this way..

the issue is that the talkid is a child of the head tag and not the file tag.

import xml.etree.ElementTree as ET

all_talks = 'file.xml'

context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
    if elem.tag == 'file':
        head = elem.find('head')
        content = elem.find('content').text
        title = head.find('talkid').text
        filename = format(title + ".txt")
        with open(filename, 'wb') as f:  # 'wt' or just 'w' if you want to write text instead of bytes
            f.write(content.encode())    # in which case you would remove the .encode()

edited Oct 24, 2022 at 10:47

answered Oct 24, 2022 at 10:42

Alexander

17.5k5 gold badges15 silver badges32 bronze badges

7 Comments

Leila Over a year ago

This eliminated the error, but it doesn't work. There's no output.

Alexander Over a year ago

@Leila When I ran this code on the example xml in your question it created a file called 2458.txt and it had the *** This is the content I am trying to save *** contents

Leila Over a year ago

That's strange! I even tried it with a smaller .xml file to check if it was affected by the large size, but again no output! Thanks anyway. I'll try to figure it out.

Alexander Over a year ago

@Leila make sure you are looking in the right directory.

Leila Over a year ago

Sure, but still the same.

|

Boris Gezkovski · Accepted Answer · 2022-10-24 15:13:10Z

You can use Beautiful Soup to parse xml.

It would like this(i added a second talk id to the xml to demonstrate finding multiple tags)

xml_file = '''<xml>
<file id="13">
  <head>
    <talkid>2458</talkid>
    <transcription>
      <seekvideo id="645">So in college,</seekvideo>
      ...
    </transcription>
     <talkid>second talk id</talkid>
  </head>
  <content> *** This is the content I am trying to save *** </content>
</file>
<file>
      ... 
</file>
</xml>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(xml_file, "xml")

first_talk_id = soup.find('talkid').get_text()
talk_ids = soup.findAll('talkid')

print(first_talk_id)
# prints 2458


for talk in talk_ids:
    print(talk.get_text())

# prints 
# 2458
# second talk id

NOTE: you will need to install a parser for bs4 to work with xml pip install lxml for instance.

Collectives™ on Stack Overflow

Split a large xml file into multiple based on tag in Python

3 Answers 3

7 Comments

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related