1

First of all I'd like to say this is my first post on SO, which has been of great help for years to me, so thank you all!

Now onto my question:

  • I have a string of characters containing unicode text, html tags and bbcode tags (which is obviously extracted from a forum).

Sample:

This is my sample text.
It may contain <a href="http://www.somesite.org/test.htm">HTML tags</a>,
[b]BBCode[b],
or even <a href="http://www.someothersite.com/">[b][u]both[/u] intricated[/b]</a>!
  • I have also a list of keywords which may appear in the text described above, and for each of these words I have an associated URL.

Sample:

kw = {'sample': 'http://www.sample.fr', 'BBCode': 'http://www.bbcode.sp'}

As you can see I'm currently using Python because I'm used to the language, but I can be flexible.

My goal is to detect which word(s) in my keyword list is present in the sample text, and to "decorate" the matching word(s) with a link (preferably in bbcode) to the corresponding URL, without altering the rest of the string (just like for Wikis).

Taking further the examples above I'd like to retrieve:

This is my [url=http://www.sample.fr]sample[/url] text.
It may contain <a href="http://www.somesite.org/test.htm">HTML tags</a>,
[b][url=http://www.bbcode.sp]BBCode[/url][b],
or even <a href="http://www.someothersite.com/">[b][u]both[/u] intricated[/b]</a>!

The main problem here is that sometimes, one of the keywords in my list appears inside a tag, which I do not want to "decorate" with a link for obvious reasons.

In other words, the text I'd like to replace can be located only outside the anchor tags:

**HERE** <not here>[not here] **HERE** [/not here]</not here> **HERE**

Also, I've already tried using BeautifulSoup (along with PostMarkup to convert BBCode to HTML before parsing with BeautifulSoup) but it doesn't allow me to keep the initial string...

Remark: "real" text actually can never be placed between brackets (angle nor squared) due to the general usage of my forum, so this simplifies the problem quite a bit.

I'm sorry for my very long question, I hope everything is clear!

Any help appreciated, thanks to everyone by advance!

Update: Casimir's solution in Python (see below) works just great. Thank you Casimir et Hippolyte!

7
  • Your first question is fine and researched - good job. Your second regarding how to implement it server side is a bit broad - you may wish to consider editing your post to omit that and focus on a single issue. Commented Mar 7, 2015 at 23:25
  • So text within [b]'s should be eligible for replacement? Is that the rule? Commented Mar 7, 2015 at 23:32
  • Yes, just like "BBCode" in the example. But not inside the tag (for instance if I have: "<a href='www.BBCode.com/'>foo</a>", I want this instance of "BBCode" to stay unaltered). Commented Mar 7, 2015 at 23:36
  • Okay, but sample in the text gets URL'd as well? So, I guess the rule is - either in text, or between [b] ? Or... can it just be generalised to not inside an anchor tag? Commented Mar 7, 2015 at 23:47
  • Exactly. --> Replace text with URL'd text ? "YES <NO>[NO]YES[/NO]</NO> YES": that's what I was trying to say, sorry I wasn't clear enough! Commented Mar 7, 2015 at 23:52

1 Answer 1

3

To do that, the way is always the same: you must match first what you want to avoid.

Example:

(?s)     # dotall mode
(      # capture with all what you want to avoid
    <!--.*?--> # html comment
  |
    <[^>]+> # html tag
  |
    \[[^\]]+\] # bbcode
)
|    # OR
kw1|kw2|kw3|...

Then you must use a function as replacement, inside the function when the capture group 1 is defined, you return the match, otherwise you return the corresponding string for the keyword.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for your very interesting answer, indeed this was the first method I tried, but once I have isolated the "unwanted" part of the text, I don't know what to do with it... So you suggest I should define a subclass of re with a method replace that would search the string and replace each match with either the URL'd string if it is in the keywords or itself if it is not?
@ManuelANDIA: re.sub allows a function as replacement parameter (instead of a string, see stackoverflow.com/questions/2094975/python-re-sub-question ). So all you need to do is to return the unwanted part as it when the group 1 exists inside the function, something like: if m.group(1): return m.group(1)
Indeed I remember having read this somewhere, thank you very much I'm going to try it ASAP!
Hi @Casimir, I've had some time to try and implement your solution into my code, however there's something I didn't mention because I hadn't thought about it: some words might be encountered in the text, not exactly in the form they have in the list of keywords... For instance I can have in my keywords Fellow and in the text Fella, is it possible to have regex or Python match them together "simply" or is it too difficult? Thank you very much anyway!
@ManuelANDIA: Why don't you create a new entry for Fella with the same url in your dictionary?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.