Replacing the nth substring within a string in python using regex

Question

I'm writing a function to edit many strings in an html file at once. The requirements are a bit peculiar, however. Here's an example.

My String:

a href='http://en.wikipedia.org/wiki/Velocity'>
<img src="/uploads/3/3/9/3/3393839/____________________________________________________________________________________________________________________________________________________614162727.png" alt="Picture" style="width:100%;max-width:220px" />
</a>
<div style="display:block;font-size:90%"></div>
</div></div>

</td>
<td class='wsite-multicol-col' style='width:50%;padding:0 5px'>

<div><div class="wsite-image wsite-image-border-none " style="padding-top:0;padding-bottom:0;margin-left:0;margin-right:0;text-align:right">
<a href='http://www2.franciscan.edu/academic/MathSci/MathScienceIntegation/MathScienceIntegation-827.htm'>
<img src="/uploads/3/3/9/3/3393839/___________________________________________________________________________________________________________________________________308536556.png" alt="Picture" style="width:100%;max-width:595px" />
</a>

The actual string is much longer! I'm trying to replace all images that refer to a wikipedia links with one image and all that refer to another link to another image.

Here's what I have so far:

wikiPath = r"www.somewebsite.com/myimage.png"

def dePolute(myString):

    newString =""

    # Last index found
    lastIndex = 0


    while True:
        wikiIndex = myString.index('wikipedia',lastIndex)
        picStartIndex = myString.index('<img ', wikiIndex)
        picEndIndex = myString.index('/>', wikiIndex)

        newString = re.sub(r'<img.*?/>','src="' + wikiPath ,myString,1)

    return newString

So this obviously doesn't work - but the idea I had was to first find the index of the 'wiki' keyword that exists for all of those links and sub between img tags starting from that index. Unfortunately I don't know how to do re.sub but starting at a particular index. I can't do newString = re.sub(specification, newEntry, originalString[wikiIndex:]) because that would return a substring and not the entire string.

This is what I would like My String to look like after the program finishes running:

a href='http://en.wikipedia.org/wiki/Velocity'>
<img src="www.somewebsite.com/myimage.png" alt="Picture" style="width:100%;max-width:220px" />
</a>
<div style="display:block;font-size:90%"></div>
</div></div>

</td>
<td class='wsite-multicol-col' style='width:50%;padding:0 5px'>

<div><div class="wsite-image wsite-image-border-none " style="padding-top:0;padding-bottom:0;margin-left:0;margin-right:0;text-align:right">
<a href='http://www2.franciscan.edu/academic/MathSci/MathScienceIntegation/MathScienceIntegation-827.htm'>
<img src="/uploads/3/3/9/3/3393839/___________________________________________________________________________________________________________________________________308536556.png" alt="Picture" style="width:100%;max-width:595px" />
</a>

Thank you for showing an example input. Can you show us the output you want to make sure we understand what you're trying to do? — Quentin Pradet
– Quentin Pradet, Commented Feb 19, 2016 at 4:07

alecxe · Accepted Answer · 2016-02-19 04:30:01Z

4

I would do that with an HTML parser, like BeautifulSoup.

The idea is to use a CSS selector to locate img elements located inside a elements that have wikipedia inside href. For every img element would, replace the src attribute value:

from bs4 import BeautifulSoup

data = """your HTML"""

soup = BeautifulSoup(data, "html.parser")

for img in soup.select("a[href*=wikipedia] img[src]"):
    img["src"] = wikiPath

print(soup.prettify())

answered Feb 19, 2016 at 4:30

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ilyaU Over a year ago

Thanks! But I keep getting the following error: Traceback (most recent call last): File "Remove_Images.py", line 66, in <module> print dePolute(sourceFileString) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 50077: ordinal not in range(128) Program ended with exit code: 1

Collectives™ on Stack Overflow

Replacing the nth substring within a string in python using regex

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related