0

I have a very large xml file which I need to split into several based on a particular tag. The XML file is something like this:

<xml>
<file id="13">
  <head>
    <talkid>2458</talkid>
    <transcription>
      <seekvideo id="645">So in college,</seekvideo>
      ...
    </transcription>
  </head>
  <content> *** This is the content I am trying to save *** </content>
</file>
<file>
      ... 
</file>
</xml>

I want to extract the content of each file and save based on the talkid.

Here is the code I have tried with:

import xml.etree.ElementTree as ET

all_talks = 'path\\to\\big\\file'

context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
    if elem.tag == 'file':
        content = elem.find('content').text
        title = elem.find('talkid').text
        filename = format(title + ".txt")
        with open(filename, 'wb', encoding='utf-8') as f:
            f.write(ET.tostring(content), encoding='utf-8')

But I get the following error:

AttributeError: 'NoneType' object has no attribute 'text'

3 Answers 3

1

If you're already using .iterparse() it's more generic to rely just on events:

import xml.etree.ElementTree as ET
from pathlib import Path

all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))

for event, element in context:
    if event == 'end':
        if element.tag == 'talkid':
            title = element.text
        elif element.tag == 'content':
            content = element.text
        elif element.tag == 'file' and title and content:
            with open(all_talks.with_name(title + '.txt'), 'w') as f:
                f.write(content)
    elif element.tag == 'file':
        content = title = None

Upd. In similar question @Leila asked how to write text from all <seekvideo> tags to file instead of <content> to file, so here is a solution:

import xml.etree.ElementTree as ET
from pathlib import Path

all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))

for event, element in context:
    if event == 'end':
        if element.tag == 'file' and title and parts:
            with open(all_talks.with_name(title + '.txt'), 'w') as f:
                f.write('\n'.join(parts))
        elif element.text:
            if element.tag == 'talkid':
                title = element.text
            elif element.tag == 'seekvideo':
                parts.append(element.text)
    elif element.tag == 'file':
        title = None
        parts = []
Sign up to request clarification or add additional context in comments.

7 Comments

What if I want to extract transcription lines, however, without <seekvideo> tags? Could you please help me with that?
@Leila, add one more condition elif element.tag == 'transcription':
That doesn't work. The output is blank. I also tried another way with findall(), but again didn't work. Added it as a new question. stackoverflow.com/questions/74182062/…
@Leila, try this.
@Leila, I've edited code to avoid tags with no text.
|
1

Try doing it this way..

the issue is that the talkid is a child of the head tag and not the file tag.

import xml.etree.ElementTree as ET

all_talks = 'file.xml'

context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
    if elem.tag == 'file':
        head = elem.find('head')
        content = elem.find('content').text
        title = head.find('talkid').text
        filename = format(title + ".txt")
        with open(filename, 'wb') as f:  # 'wt' or just 'w' if you want to write text instead of bytes
            f.write(content.encode())    # in which case you would remove the .encode() 

7 Comments

This eliminated the error, but it doesn't work. There's no output.
@Leila When I ran this code on the example xml in your question it created a file called 2458.txt and it had the *** This is the content I am trying to save *** contents
That's strange! I even tried it with a smaller .xml file to check if it was affected by the large size, but again no output! Thanks anyway. I'll try to figure it out.
@Leila make sure you are looking in the right directory.
Sure, but still the same.
|
1

You can use Beautiful Soup to parse xml.

It would like this(i added a second talk id to the xml to demonstrate finding multiple tags)

xml_file = '''<xml>
<file id="13">
  <head>
    <talkid>2458</talkid>
    <transcription>
      <seekvideo id="645">So in college,</seekvideo>
      ...
    </transcription>
     <talkid>second talk id</talkid>
  </head>
  <content> *** This is the content I am trying to save *** </content>
</file>
<file>
      ... 
</file>
</xml>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(xml_file, "xml")

first_talk_id = soup.find('talkid').get_text()
talk_ids = soup.findAll('talkid')

print(first_talk_id)
# prints 2458


for talk in talk_ids:
    print(talk.get_text())

# prints 
# 2458
# second talk id 

NOTE: you will need to install a parser for bs4 to work with xml pip install lxml for instance.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.