1

I'm writing a function to edit many strings in an html file at once. The requirements are a bit peculiar, however. Here's an example.

My String:

a href='http://en.wikipedia.org/wiki/Velocity'>
<img src="/uploads/3/3/9/3/3393839/____________________________________________________________________________________________________________________________________________________614162727.png" alt="Picture" style="width:100%;max-width:220px" />
</a>
<div style="display:block;font-size:90%"></div>
</div></div>

</td>
<td class='wsite-multicol-col' style='width:50%;padding:0 5px'>

<div><div class="wsite-image wsite-image-border-none " style="padding-top:0;padding-bottom:0;margin-left:0;margin-right:0;text-align:right">
<a href='http://www2.franciscan.edu/academic/MathSci/MathScienceIntegation/MathScienceIntegation-827.htm'>
<img src="/uploads/3/3/9/3/3393839/___________________________________________________________________________________________________________________________________308536556.png" alt="Picture" style="width:100%;max-width:595px" />
</a>

The actual string is much longer! I'm trying to replace all images that refer to a wikipedia links with one image and all that refer to another link to another image.

Here's what I have so far:

wikiPath = r"www.somewebsite.com/myimage.png"

def dePolute(myString):

    newString =""

    # Last index found
    lastIndex = 0


    while True:
        wikiIndex = myString.index('wikipedia',lastIndex)
        picStartIndex = myString.index('<img ', wikiIndex)
        picEndIndex = myString.index('/>', wikiIndex)

        newString = re.sub(r'<img.*?/>','src="' + wikiPath ,myString,1)

    return newString 

So this obviously doesn't work - but the idea I had was to first find the index of the 'wiki' keyword that exists for all of those links and sub between img tags starting from that index. Unfortunately I don't know how to do re.sub but starting at a particular index. I can't do newString = re.sub(specification, newEntry, originalString[wikiIndex:]) because that would return a substring and not the entire string.


This is what I would like My String to look like after the program finishes running:

a href='http://en.wikipedia.org/wiki/Velocity'>
<img src="www.somewebsite.com/myimage.png" alt="Picture" style="width:100%;max-width:220px" />
</a>
<div style="display:block;font-size:90%"></div>
</div></div>

</td>
<td class='wsite-multicol-col' style='width:50%;padding:0 5px'>

<div><div class="wsite-image wsite-image-border-none " style="padding-top:0;padding-bottom:0;margin-left:0;margin-right:0;text-align:right">
<a href='http://www2.franciscan.edu/academic/MathSci/MathScienceIntegation/MathScienceIntegation-827.htm'>
<img src="/uploads/3/3/9/3/3393839/___________________________________________________________________________________________________________________________________308536556.png" alt="Picture" style="width:100%;max-width:595px" />
</a>
2
  • Thank you for showing an example input. Can you show us the output you want to make sure we understand what you're trying to do? Commented Feb 19, 2016 at 4:07
  • Edited, hope it helps! Commented Feb 22, 2016 at 20:33

1 Answer 1

4

I would do that with an HTML parser, like BeautifulSoup.

The idea is to use a CSS selector to locate img elements located inside a elements that have wikipedia inside href. For every img element would, replace the src attribute value:

from bs4 import BeautifulSoup

data = """your HTML"""

soup = BeautifulSoup(data, "html.parser")

for img in soup.select("a[href*=wikipedia] img[src]"):
    img["src"] = wikiPath

print(soup.prettify())
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! But I keep getting the following error: Traceback (most recent call last): File "Remove_Images.py", line 66, in <module> print dePolute(sourceFileString) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 50077: ordinal not in range(128) Program ended with exit code: 1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.