python text parsing to get filtered output

Question

My goal is to search file.txt to find a identifying string and then output the following words between the quotation marks.

So the identifier would be data-default-alt= and the name of the item is "Ford Truck" in quotes. I would like to output the name of the item and the price so that i can open it in excel.

data-default-alt="Ford Truck">       </h3>     </a>           </div>     <div class="tileInfo">                <div class="swatchesBox--empty"></div>                                                     <div class="promo-msg-text">           <span class="calloutMsg-promo-msg-text"></span>         </div>                              <div class="pricecontainer" data-pricetype="Stand Alone">               <p id="price_206019013" class="price price-label ">                  $1,000.00               </p>

Desired Output would be

Ford Truck 1000.00

I am not sure how to go about this task.

Have you tried regular expressions?

jacob
– jacob

2016-03-24 17:51:30 +00:00
Commented Mar 24, 2016 at 17:51 — jacob
– jacob, Commented Mar 24, 2016 at 17:51

Yavar · Accepted Answer · 2016-03-24 18:31:35Z

1

Well please construct more robust regular expressions for matching your cost and/or brand, here is some code to get you started.

str = '<data-default-alt="Ford Truck"></h3></a></div><div class="tileInfo"><div class="swatchesBox--empty"></div><div class="promo-msg-text"> <span class="calloutMsg-promo-msg-text"></span> </div><div class="pricecontainer" data-pricetype="Stand Alone"><p id="price_206019013" class="price price-label ">$1,000.00</p>'

import re

brand=re.search('<data-default-alt=\"(.*?)">',str)
cost=re.search('\$(\d+,?\d*\.\d+)</p>',str)
if brand:
        print brand.group(1)
if cost:
        print cost.group(1)

edited Mar 24, 2016 at 18:31

answered Mar 24, 2016 at 18:25

Yavar

11.9k5 gold badges34 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

turtle02 Over a year ago

Thanks this gave me output of Ford Truck 1,000.00. How do i read from a text file?

turtle02 Over a year ago

there will be multiples of these in each file also how do i get all of them?

turtle02 Over a year ago

I am EST Time zone

turtle02 Over a year ago

Getting this error now. brand=re.search('<data-default-alt=\"(.*?)">',str) File "C:\Users\turtle02\Anaconda2\lib\re.py", line 146, in search return _compile(pattern, flags).search(string) TypeError: expected string or buffer

turtle02 Over a year ago

all code with open("file.txt") as str: st = str.read() import re brand=re.search('<data-default-alt=\"(.*?)">',str) cost=re.search('\$(\d+,?\d*\.\d+)</p>',str) if brand: print brand.group(1) if cost: print cost.group(1)

|

illright · Accepted Answer · 2016-03-24 18:16:34Z

0

Use the default string methods to find the substring index. For example, "abcdef".find("bc") would return 1, which is the index of the first letter of the substring. To parse your string, you could look for tags and then extract the needed text using string slicing.
So this is an example of solving your problem, considering that the parsed string is being stored in a st variable:

with open("file.txt") as f:
    st = f.read() # that's to get the file contents
name_start = st.find('data-default-alt="') + len('data-default-alt="') # found the first letter's index and added the substring's length to it to skip to the part of the actual data
name_end = st[name_start:].find('"') # found the closing quote
name = st[name_start:name_start + name_end] # sliced the string to get what we wanted

price_start = st.find('class="price price-label ">') + len('class="price price-label ">')
price_end = st[price_start:].find('</p>')
price = st[price_start:price_start + price_end].strip().rstrip()

The results are in name and price variables. If you wanna work with the price as a number and don't want the dollar sign, add it to the strip arguments (.strip("$ "), read more on that method in Python docs). You can remove the comma by calling a replace(",", "") on the price string and after all, convert the string to a float using float(price)
Notes: it may just be the way you put the parsed string in, but I've added strip() and rstrip() methods to get rid of whitespaces on each end of the price string.

edited Mar 24, 2016 at 18:16

answered Mar 24, 2016 at 17:58

illright

4,0732 gold badges31 silver badges54 bronze badges

11 Comments

turtle02 Over a year ago

I seem to have messed up something i get this output {{= $item.parent.data.itemAttributes.title}} $2.84 there will be multiples of these in each file also how do i get all of them?

illright Over a year ago

@turtle02 If you will have multiple of those, you might be better off using regular expressions. If you're having trouble reading from a file, take a look at the first two lines of my code, they do just that.

turtle02 Over a year ago

Getting this error now. brand=re.search('<data-default-alt=\"(.*?)">',str) File "C:\Users\turtle02\Anaconda2\lib\re.py", line 146, in search return _compile(pattern, flags).search(string) TypeError: expected string or buffer

illright Over a year ago

@turtle02 Please, provide the str variable contents by printing it right before that error-causing line

turtle02 Over a year ago

|

Collectives™ on Stack Overflow

python text parsing to get filtered output

2 Answers 2

6 Comments

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related