How to parse list of messy strings in python

Question

I'm trying to extract some ID's and their status from an xml file and I have reached a point where I have a list of strings containing this info and I just need to extract the codes and the status(pass or fail). The problem is, the strings are extremely messy and I'm a newbie to python so I'm not sure how to do this. The piece of code that needs to be looked at:

l = len(res_smaller)
print(res_smaller)
for field in range(1, l, 2):
    aux = res_smaller[field]
    print(aux)

Output:

<big class="Heading3">1.2 <a name="i__786909744_34">Test Case CODE_2096: RANDOMTEXT</a>: Failed</big>
<big class="Heading3">1.3 <a name="i__786909744_1424">Test Case CODE_2101: RANDOMTEXT</a>: Failed</big>
<big class="Heading3">1.4 <a name="i__786909744_2814">Test Case CODE_2111: RANDOMTEXT</a>: Failed</big>   
<big class="Heading3">1.5 <a name="i__786909744_2850">Test Case CODE_2098: RANDOMTEXT</a>: Failed</big>

I used the BeautifulSoup library to find_all Heading3 classes, parsed a bit more and now I have a list from which a printed the lines that are of interest to me (this is why I used an increment of 2 from 1). My idea is to create a dictionary of the form: CODE_NUMBER: STATUS, but I don't know how to extract from each field these things. My idea was to use aux.split(" ") to split them by the whitespace delimiter, and extract the 5th and 7th element from each field, but this gives me an error so I'm not sure if this is possible in python. Any ideas?

EDIT: Here's the code with the aux.split, I've also added the list printed as a whole:

l = len(res_smaller)
print(res_smaller)

for field in range(1, l, 2):
    aux = res_smaller[field]
    print(aux.split(" "))

Output:

[<big class="Heading3">1.1 <a name="i__786909744_13">RANDOMTEXT</a>: Passed</big>, <big class="Heading3">1.2 <a name="i__786909744_34">Test Case CODE_2096: RANDOMTEXT</a>: Failed</big>, <big class="Heading3">Main Part of Test Case</big>, <big class="Heading3">1.3 <a name="i__786909744_1424">Test Case CODE_2101: RANDOMTEXT</a>: Failed</big>, <big class="Heading3">Main Part of Test Case</big>, <big class="Heading3">1.4 <a name="i__786909744_2814">Test Case CODE_2111: RANDOMTEXT</a>: Failed</big>, 
<big class="Heading3">Main Part of Test Case</big>, <big class="Heading3">1.5 <a name="i__786909744_2850">Test Case CODE_2098: RANDOMTEXT</a>: Failed</big>, <big class="Heading3">Main Part of Test Case</big>]
Traceback (most recent call last):
  File "D:\Code\Python\Projects\HTML_parser.py", line 43, in <module>
    print(aux.split(" "))
TypeError: 'NoneType' object is not callable

Please show your code using aux.split() because that should work — Erik McKelvey
– Erik McKelvey, Commented Dec 1, 2021 at 21:33
@PM77-1 yes I would need 4th and 6th element, I wrote it like that for better understanding — Andrei0408
– Andrei0408, Commented Dec 1, 2021 at 21:36
@ErikMcKelvey If I try to print(aux.split(" ")`)``, I get the error TypeError: 'NoneType' object is not callable``` — Andrei0408
– Andrei0408, Commented Dec 1, 2021 at 21:38
You're not iterating a list of strings, but a list of bs4 objects. I think that's why the calls to split are failing. You can use the methods and attributes from that library to get access to the data, or you can cast them to strings by going aux = str(res_smaller[field]) — Willow
– Willow, Commented Dec 1, 2021 at 23:48

Luv_Python · Accepted Answer · 2021-12-01 23:30:50Z

2

Highly suggest using findall in re module. Since the input is not included, I am working with what I have:

import re
l = len(res_smaller)
print(res_smaller)
my_dict = {}
for field in range(1, l, 2):
    aux = res_smaller[field]
    status = re.findall('</a>: (.*?)</big>', aux, re.DOTALL)
    code = re.findall('Case (.*?):', aux, re.DOTALL)
    my_dict[code[0]] = status[0]
print(my_dict)

output:

{'CODE_2096': 'Failed', 'CODE_2101': 'Failed', 'CODE_2111': 'Failed', 'CODE_2098': 'Failed'}

answered Dec 1, 2021 at 23:30

Luv_Python

2722 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to parse list of messy strings in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related