2

I have a list of over 500 strings I need to search for. (They're URLs, if that matters.) I have a web site with over 1,000 web pages. I want to search each of those web pages to find which URLs each links to.

Back when our web site was on a Unix box, I would've written a little shell script using find and grep to accomplish this, but now we're on a Windows machine, so that's not really an option. I've no experience with PowerShell at all, but I suspect this is what I need. However, I've no idea how to even start.

Ideally, what I would like to end up with is something like this:

<filename 1>
    <1st string found>
    <2nd string found>
    <3rd string found>
<filename 2>
    <1st string found>
    <2nd string found>

I don't need to know the line number; I just need to know which URLs are in which files. (We're going to be moving all 500+ target URLs to new locations, so we're going to have to manually update the links in the 1,000+ web pages. It will be a royal pain.)

Presumably the logic would be something like this:

for each file {
    print the filename
    for each string {
        if string found in file {
            print the string
        }
    }
}

We can't do a find/replace directly because the web pages are located in a content management system. All we can do is locate which pages need to be updated (using a static copy of the web pages on a local drive), then manually update the individual pages in the CMS.

I'm hoping this is easy to do, but my complete unfamiliarity with PowerShell means I've no idea where to start. Any help would be greatly appreciated!

Update

Thanks to Travis Plunk for the help! Based upon his answer, here is the final version of the code I'll be using.

# Strings to search for
$strings = @(
    'http://www.ourwebsite.com/directory/somefile.pdf'
    'http://www.ourwebsite.com/otherdirectory/anotherfile.pdf'
    'http://www.otherwebsite.com/directory/otherfile.pdf'
)

# Directory containing web site files
cd \OurWebDirectory

$results = @(foreach($string in $strings)
{
    Write-Host "Searching files for $string"
    # Excluding the images directory
    dir . -Recurse -Exclude \imagedir | Select-String -SimpleMatch $string 
}) | Sort-Object -Property path

$results | Group-Object -Property path | %{
    "File: $($_.Name)"
    $_.Group | %{"`t$($_.pattern)"}
}
3
  • So, you're scraping the end-user visible page (what it would look like, i.e. body only) or the full HTML contents itself? ((EDIT: This matters, because we need to save the full HTML and search in all the href fields, for example)). Commented May 26, 2016 at 21:08
  • 1
    Findstr ? Commented May 26, 2016 at 21:14
  • I've got local disk access to the HTML files themselves, so no screen scraping or web crawling will be needed. Commented May 27, 2016 at 13:06

2 Answers 2

2

This does very close to what you want.

# Strings to search for
$strings = @(
    'string1'
    'string2'
    )

$results = @(foreach($string in $strings)
    {
        # Be sure to update path to search and file search pattern
        dir .\testdir\*.* -Recurse | Select-String -SimpleMatch $string   
    } 
) | Sort-Object -Property path

$results | Select-Object 'path', 'pattern', 'LineNumber'

Example output

Path                             Pattern LineNumber
----                             ------- ----------
C:\Users\travi\testdir\test1.txt string1          1
C:\Users\travi\testdir\test1.txt string2          2
C:\Users\travi\testdir\test2.txt string1          2
C:\Users\travi\testdir\test2.txt string2          1

You can add line to the `select-object' statement to print the entire line.

To get output a little more like what you asked for this code to print the results:

$results | Group-Object -Property path | %{
    "File: $($_.Name)"
    $_.Group | %{"`t$($_.linenumber):$($_.line)"}
}

Will give output like this:

File: C:\Users\travi\testdir\test1.txt
    1:string1
    2:string2
File: C:\Users\travi\testdir\test2.txt
    2:string1
    1:string2
Sign up to request clarification or add additional context in comments.

2 Comments

This looks promising! I'll give it a try today and let you know how it goes. Thanks!
This looks like it's going to do the trick! I did need to make a change (your version was displaying the line number and the line, whereas what I wanted was the string that was searched for), but I figured that out, so it looks like we're good! I'll edit my question with the final version of the code. Thanks!
0

Per n00dl3's comment on the OP, findstr is a good solution for this.

From Command line reference / findstr

If you want to search for several different items in the same set of files, create a text file that contains each search criterion on a new line. You can also list the exact files you want to search in a text file. To use the search criteria in the file Finddata.txt, search the files listed in Filelist.txt, and then store the results in the file Results.out, type the following:

findstr /g:finddata.txt /f:filelist.txt > results.out

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.