First of all I'd like to say this is my first post on SO, which has been of great help for years to me, so thank you all!
Now onto my question:
- I have a string of characters containing unicode text, html tags and bbcode tags (which is obviously extracted from a forum).
Sample:
This is my sample text.
It may contain <a href="http://www.somesite.org/test.htm">HTML tags</a>,
[b]BBCode[b],
or even <a href="http://www.someothersite.com/">[b][u]both[/u] intricated[/b]</a>!
- I have also a list of keywords which may appear in the text described above, and for each of these words I have an associated URL.
Sample:
kw = {'sample': 'http://www.sample.fr', 'BBCode': 'http://www.bbcode.sp'}
As you can see I'm currently using Python because I'm used to the language, but I can be flexible.
My goal is to detect which word(s) in my keyword list is present in the sample text, and to "decorate" the matching word(s) with a link (preferably in bbcode) to the corresponding URL, without altering the rest of the string (just like for Wikis).
Taking further the examples above I'd like to retrieve:
This is my [url=http://www.sample.fr]sample[/url] text.
It may contain <a href="http://www.somesite.org/test.htm">HTML tags</a>,
[b][url=http://www.bbcode.sp]BBCode[/url][b],
or even <a href="http://www.someothersite.com/">[b][u]both[/u] intricated[/b]</a>!
The main problem here is that sometimes, one of the keywords in my list appears inside a tag, which I do not want to "decorate" with a link for obvious reasons.
In other words, the text I'd like to replace can be located only outside the anchor tags:
**HERE** <not here>[not here] **HERE** [/not here]</not here> **HERE**
Also, I've already tried using BeautifulSoup (along with PostMarkup to convert BBCode to HTML before parsing with BeautifulSoup) but it doesn't allow me to keep the initial string...
Remark: "real" text actually can never be placed between brackets (angle nor squared) due to the general usage of my forum, so this simplifies the problem quite a bit.
I'm sorry for my very long question, I hope everything is clear!
Any help appreciated, thanks to everyone by advance!
Update: Casimir's solution in Python (see below) works just great. Thank you Casimir et Hippolyte!
[b]'s should be eligible for replacement? Is that the rule?[b]? Or... can it just be generalised to not inside an anchor tag?