1

I have a series of urls in a web doc, something like this:

<a href="somepage.php?x=some_document.htm">click here</a>

What I want to do is replace the bold piece:

<a href="somepage.php?x=some_document.htm">click here</a>

.. with some sort of encrypted variation (lets just say base64_encoding) .. something like this:

for each match, turn it into base64_encode(match)

Notes:

1.the phrase href="somepage.php?x= will always precede the phrase.
2.a double-quote (") will always follow the phrase.

2
  • Depending on how you want to run the replacement, different strategies might be appropriate. From what you write, it seems you have the document as a string somewhere and don't generate it yourself? The best solution would obviously be to handle the encoding/encryption when generating the document Commented May 22, 2009 at 14:49
  • It is never appropriate to edit a question to add "the solution" to the question body. Resolving advice is to be posted as an answer. Commented Jun 6, 2024 at 10:11

5 Answers 5

6

I think you are looking for something like this:

function doSomething($matches) {
   return base64_encode($matches[1]);
}

preg_replace_callback('/href="somepage.php?x=([^"]+)"/', 'doSomething', $webdoc);

The preg_replace answer works similarly. If you want to do something more elaborate, the callback would allow you do to that

Sign up to request clarification or add additional context in comments.

4 Comments

I must be doing something wrong - cant get this to work. When I try it it replaces the entire phrase encoded instead of just the some_document.htm part.
starting from PHP 5.3.0 you can use anonymous function instead of 'doSomethind'. Read more at php.net/manual/functions.anonymous.php
Chris - figured out the missing piece from your suggestion and updated my orig post with the solution which is derived from your idea - thanks!
Sorry about that, I didn't actually test the snippet. Checking the documentation revealed that the return value of the callback function will replace the entire matched string, not just the matched element, as you already figured out. I am glad I could help you to get it to work.
2

I would consider using the PHP DOM parser. Anything less is a hack. (Not that hacks are always bad, just know the difference between a simple regex and a DOM parser.) getElementsByTagName() will get your <a> tags, getAttribute() will get your href attributes, and setAttribute() modifies.

5 Comments

I think he's asking about doing this server side.
Yeah, you'll note that I linked to the PHP DOM parser.
That's good if entire web doc is DOM-comatible. If it has been obtained from remote server - are you sure, that guys on that side done HTML well? I.e. transitional HTML says that you may not close some tags (tag <p>, etc.) which is AFAIK not DOM-compatible...
Ups. Ecuse me for confusing comment. I checked and DOMDocument + SimpleXML works good if you define $doc->strictErrorChecking = false; Well... I had no idea that it works so good. Thanks to Adam.
@Adam, you linked to the PHP DOM parser after I posted my comment.
1

preg_replace('/href="somepage.php\?x=([^"]*)"/e', "somepage.php?x='.base64_encode("$1").'"', $url)

(not tested). The /e means you can use an expression in the replacement string

3 Comments

The replacement pattern will be passed to eval() as a whole, and [sompege.php?x=...] ist not valid PHP
using this, but halts script (any ideas?): $html = preg_replace('/href="document.php\?x=([^"]*)"/e', "href=\"document.php?x=" . base64_encode("$1") . '"', $html);
Yes, I missed out a ' at the start of the replacement text. I said it was untested ;-)
1

It seems like you might be conflating a multi-step task, which may ultimately create more trouble in the long run. You'd basically like to do three things:

  1. Find all anchor tags on a page
  2. Extract the URL in the href attribute from these tags
  3. Extract a specific variable in the query string from that URL

There is a number of ways to do this in PHP. Yes, one direct way is using a regular expression, but it's less transparent. For this particular case, you're really data fitting a very small problem, reduces the scalability of your code for future applications.

My suggestion is the implementation of a light DOM parser available from Source Forge called SimpleHTMLDom. Using this parser, you can write much clearer code for the task you're undertaking.

foreach ($dom_object->find('a') as $anchor){
    $url = $anchor->href;
    $queryArray = array();
    parse_str(parse_url($url, PHP_URL_QUERY), $queryArray);
    $myVariable = $queryArr['x'];
}

And then of course $myVariable will be the value you're looking to get with that regex.

Comments

0

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.