1

I am trying to take some information I got from a webpage and write one of the variables to a file however I am having no luck it is probably very easy but I'm lost. Here is an example of one of the rows there are 1253 rows.

<div class='entry qual-5 used-demoman slot-head bestprice custom' data-price='3280000' data-name="Kill-a-Watt Allbrero" data-quality="5" data-australium="normal" data-class="demoman" data-particle_effect="56" data-paint="" data-slot="cosmetic" data-consignment="consignment">

I am after the field called data-name it is not at the same spot in each row. I tried this but it did not work

mfile=open('itemlist.txt','r')
mfile2=open('output.txt','a')
for row in mfile:
    if char =='data-name':
        mfile2.write(char)

Edit 1:

I made an example file of 'hello hi peanut' if did:

for row in mfile:
    print row.index('hello')

it would print 0 as expected however when I changed the hello to hi it didnt return 1 it returned nothing.

4
  • char is not defined in your code. You could use row.index('data-name') to figure out where the attribute begins. Then you can index again starting from that index to find the two quotation marks and use string manipulation to extract the value. Commented Jul 5, 2015 at 20:36
  • Could you put this as an answer with an example so I can accept it as an answer Commented Jul 5, 2015 at 20:40
  • I would actually want you to try it on your own first before showing you how to do it. So why don’t you give it a try and then if that fails, show what you have tried, and then we can try to explain you where you went wrong. That way, you learn best. Commented Jul 5, 2015 at 20:42
  • im trying it but i've found that it only looks at the first value and doesn't look at the rest of the values in the row Commented Jul 5, 2015 at 20:47

2 Answers 2

3

Let’s try to find the value using common string manipulation methods:

>>> line = '''<div class='entry qual-5 used-demoman slot-head bestprice custom' data-price='3280000' data-name="Kill-a-Watt Allbrero" data-quality="5" data-australium="normal" data-class="demoman" data-particle_effect="56" data-paint="" data-slot="cosmetic" data-consignment="consignment">'''

We can use str.index to find the position of a string within a string:

>>> line.index('data-name')
87

So now we know we need to start looking at index 87 for the attribute we are interested in:

>>> line[87:]
'data-name="Kill-a-Watt Allbrero" data-quality="5" data-australium="normal" data-class="demoman" data-particle_effect="56" data-paint="" data-slot="cosmetic" data-consignment="consignment">'

Now, we need to remove the data-name=" part too:

>>> start = line.index('data-name') + len('data-name="')
>>> start
98
>>> line[start:]
'Kill-a-Watt Allbrero" data-quality="5" data-australium="normal" data-class="demoman" data-particle_effect="56" data-paint="" data-slot="cosmetic" data-consignment="consignment">'

Now, we just need to find the index of the closing quotation mark too, and then we can extract just the attribute value:

>>> end = line.index('"', start)
>>> end
118
>>> line[start:end]
'Kill-a-Watt Allbrero'

And then we have our solution:

start = line.index('data-name') + len('data-name="')
end = line.index('"', start)
print(line[start:end])

We can put that in the loop:

with open('itemlist.txt','r') as mfile, open('output.txt','a') as mfile2w
    for line in mfile:
        start = line.index('data-name') + len('data-name="')
        end = line.index('"', start)
        mfile2.write(line[start:end])
        mfile2.write('\n')
Sign up to request clarification or add additional context in comments.

8 Comments

Pretty instructive and helpful answer. +1
I am trying this but i noticed that you left my broken loop in so im trying to fix that now but when i say print start and print end to check that it is finding the line.index values nothing comes out?
Oh yes, sorry, I copy/pasted too much without looking, fixed that code at the end now :)
I tried this and it isn't working for me nothing is writing or printing even when i add print start after the definition of start
Hmm, that’s weird. Try printing the line right after for line in mfile to see if any lines actually appear.
|
1

You can also use beautifulsoup:

a.html:

<html>
    <head>
        <title> Asdf </title>
    </head>
    <body>

        <div class='entry qual-5 used-demoman slot-head bestprice custom' data-price='3280000' data-name="Kill-a-Watt Allbrero" data-quality="5" data-australium="normal" data-class="demoman" data-particle_effect="56" data-paint="" data-slot="cosmetic" data-consignment="consignment">

    </body>
</html>

a.py:

from bs4 import BeautifulSoup
with open('a.html') as f:
    lines = f.readlines()
soup = BeautifulSoup(''.join(lines), 'html.parser')
result = soup.findAll('div')[0]['data-price']
print result
# prints 3280000

My opinion is, if your task is pretty easy as in your example, there is actually no need of using beautifulsoup. However, if it is more complicated, or it will be more complicated. Consider giving it a try with beautifulsoup.

5 Comments

The BeautifulSoup module name suggests that you are using version 3, which is pretty old and does not support Python 3. Please update to BeautifulSoup 4 and change the module name in your answer to bs4.
I proudly prefer using Python 2.7.6 unless the OP explicitly asks for a Python 3 solution. There is only Python tag in the question as far as I see.
Sure, but bs4 works in Python 2.6+ too, and it generally seems like a bad idea to promote outdated, and no-longer updated libraries when a newer version exists (especially when all you have to do is change it to from bs4 import BeautifulSoup)
@poke Okay, that makes sense.. I updated my answer to bs4 with keeping print result to show it is still Python 2 :-)
Yes, that’s totally fine with me; my issue was only with the old module name. Thanks :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.