0

I am trying to migrate old blog posts (based on WP) to a new platform. One of the steps is defined by:

  1. Get full_text of posts
  2. Search for the existence of full path/url of old images (let's set https://stackoverflow.com/uploads/logo.png or just uploads/logo.png)
  3. Extract/save and get the guid() of new images
  4. Switch old path https://stackoverflow.com/uploads/logo.png to a new one (let's see https://quora.com/media/brand123.png

I tried a regex expression to search for old urls: /(http:\/\/stackoverflow\.com\/uploads\/)+(.*?)[a-zA-Z0-9]+(\.jpg|\.png|\.gif)/

And then tried:

$old = array();
$pattern = "/(https:|http:\/\/stackoverflow\.com\/uploads\/)+(.*?)[a-zA-Z0-9]+(\.jpg|\.png|\.gif)/";
$text = "orem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor <img src='https://stackoverflow.com/uploads/image1.png'/> rem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor <img src='https://stackoverflow.com/uploads/image2.png'/>";

// seatch and get old urls
preg_match_all($pattern, $text, $old);

But it get's me something like this:

array(4) {
  [0]=>
  array(2) {
    [0]=>
    string(44) "https://stackoverflow.com/uploads/image1.png"
    [1]=>
    string(44) "https://stackoverflow.com/uploads/image2.png"
  }
  [1]=>
  array(2) {
    [0]=>
    string(6) "https:"
    [1]=>
    string(6) "https:"
  }
  [2]=>
  array(2) {
    [0]=>
    string(28) "//stackoverflow.com/uploads/"
    [1]=>
    string(28) "//stackoverflow.com/uploads/"
  }
  [3]=>
  array(2) {
    [0]=>
    string(4) ".png"
    [1]=>
    string(4) ".png"
  }
}

1 Answer 1

1

I think this regex will do the job a bit better:

#\b((?:https?://stackoverflow\.com/)?uploads/(.*?\.(?:jpg|png|gif)))\b#

I've simplified a bit of yours (e.g. replace https:|http: with https?:) and also removed what seems like an unnecessary [a-zA-Z0-9]+. I've also improved the grouping, making some non-capturing:

The new code (note I added an extra image reference for testing):

$old = array();
$pattern = "#\b((?:https?://stackoverflow\.com/)?uploads/(.*?\.(?:jpg|png|gif)))\b#";
$text = "orem uploads/xyx.gif ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor <img src='https://stackoverflow.com/uploads/image1.png'/> rem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor <img src='https://stackoverflow.com/uploads/image2.png'/>";

// seatch and get old urls
preg_match_all($pattern, $text, $old);
print_r($old);

Output:

Array
(
    [0] => Array
        (
            [0] => uploads/xyx.gif
            [1] => https://stackoverflow.com/uploads/image1.png
            [2] => https://stackoverflow.com/uploads/image2.png
        )

    [1] => Array
        (
            [0] => uploads/xyx.gif
            [1] => https://stackoverflow.com/uploads/image1.png
            [2] => https://stackoverflow.com/uploads/image2.png
        )

    [2] => Array
        (
            [0] => xyx.gif
            [1] => image1.png
            [2] => image2.png
        )

)

If you want to insist that image names only contain [a-zA-Z0-9] then change the .*? to [a-zA-Z0-9]+ i.e.

$pattern = "#\b((?:https?://stackoverflow\.com/)?uploads/([a-zA-Z0-9]+\.(?:jpg|png|gif)))\b#";
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.