Powershell 7.x How to Select a Text Substring of Unknown Length Only Using Boundary Substrings

Question

I am trying to store a text file string which has a beginning and end that make it a substring of the original text file. I am new to Powershell so my methods are simple/crude. Basically my approach has been:

Roughly get what I want from the start of the string
Worry about trimming off what I don't want later

My minimum reproducible example is as follows:

# selectStringTest.ps    
         
$inputFile = Get-Content -Path "C:\test\test3\Copy of 31832_226140__0001-00006.txt"

#  selected text string needs to span from $refName up to $boundaryName 
[string]$refName = "001 BARTLETT"
[string]$boundaryName = "001 BEECH"

# a rough estimate of the text file lines required
[int]$lines = 200
   
if (Select-String  -InputObject $inputFile -pattern $refName) {
    Write-Host "Selected shortened string found!"
    # this selects the start of required string but with extra text
    [string]$newFileStart = $inputFile | Select-String $refName -CaseSensitive -SimpleMatch -Context 0, $lines   
}
else {
    Write-Host "Selected string NOT FOUND."
}
# tidy up the start of the string by removing rubbish
$newFileStart = $newFileStart.TrimStart('> ')

# this is the kind of thing I want but it doesn't work
$newFileStart = $newFileStart - $newFileStart.StartsWith($boundaryName)

$newFileStart | Out-File tempOutputFile

As it is: the output begins correctly but I cannot remove text including and after $boundaryName

The original text file is OCR generated (Optical Character Recognition) So it is unevenly formatted. There are newlines in odd places. So I have limited options when it comes to delimiting.

I am not sure my if (Select-String -InputObject $inputFile -pattern $refName)is valid. It appears to work correctly. The general design seems crude. In that I am guessing how many lines I will need. And finally I have tried various methods of trimming the string from $boundaryName without success. For this:

string.split() not practical
replacing spaces with newlines in an array & looping through to elements of $boundaryName is possible but I don't know how to terminate the array at this point before returning it to string.

Any suggestions would be appreciated.

Abbreviated content of x2 200 listings single Copy of 31832_226140__0001-00006.txt file is:

Beginning of text file

________________

BARTLETT-BEDGGOOD
PENCARROW COMPOSITE ROLL
PAGE 6
PAGE 7
PENCARROW COMPOSITE ROLL
BEECH-BEST
www.
.......................
001 BARTLETT. Lois Elizabeth

Middle of text file

............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
001 BEECH, Margaret ..........

End of text file

..............312 Munita Rood Eastbourne, Civil Eng 200 BEST, Dons Amy .........
..........50 Man Street, Wamuomata, Marned
SO NON

Each text file is approximately 30KB. They span 250 lines in Notepad++ I will add a 'sample'. That is edited start text-end text. — Dave
– Dave, Commented Feb 22, 2022 at 21:28
Yeah, not asking for the complete file, just a representation of how the file looks and what would you like to have as a result — Santiago Squarzon
– Santiago Squarzon, Commented Feb 22, 2022 at 21:29
Just to confirm I understood correctly, you're looking to extract all the text between 001 BARTLETT and 001 BEECH ? And if so, do you want to include or exclude those key words ? — Santiago Squarzon
– Santiago Squarzon, Commented Feb 22, 2022 at 22:10
Should the boundary strings always begin at the beginning of a line? Should the lines containing the boundary strings be included in the output? — lit
– lit, Commented Feb 22, 2022 at 22:13

lit · Accepted Answer · 2022-02-23 00:03:18Z

1

To use a regex across newlines, the file needs to be read as a single string. Get-Content -Raw will do that. This assumes that you do not want the lines containing refName and boundaryName included in the output

$c = Get-Content -Path '.\beech.txt' -Raw
$refName = "001 BARTLETT"
$boundaryName = "001 BEECH"

if ($c -match "(?smi).*$refName.*?`r`n(.*)$boundaryName.*?`r`n.*") {
    $result = $Matches[1]
}
$result

More information at https://stackoverflow.com/a/12573413/447901

edited Feb 23, 2022 at 0:03

answered Feb 22, 2022 at 22:22

lit

16.5k11 gold badges80 silver badges146 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Dave Over a year ago

Great! Thanks. This works. The only negative is that it includes the boundary listing.

Dave Over a year ago

Thanks for the SO reference post. Your regex was confusing me. The post will help understanding.

Dave Over a year ago

Your change fixed the output end but screwed up the output start. Don't worry. You've been a great help. I have to dig into the details more. So I'll be able to correct it at some stage. Thanks.

lit Over a year ago

@Dave, are you wanting the text on the same line after the starting boundary to be in the output?

lit Over a year ago

Should the first line of output be 001 BARTLETT. Lois Elizabeth .......? That is easy enough to get in.

|

Darin · Accepted Answer · 2022-02-23 03:04:48Z

1

How close does this come to what you want?

function Process-File {
    param (
        [Parameter(Mandatory = $true, Position = 0)]
        [string]$HeadText,
        [Parameter(Mandatory = $true, Position = 1)]
        [string]$TailText,
        [Parameter(ValueFromPipeline)]
        $File
    )
    Process {
        $Inside = $false;
        switch -Regex -File $File.FullName {
            #'^\s*$' { continue }
            "(?i)^\s*$TailText(?<Tail>.*)`$"    { $Matches.Tail; $Inside = $false }
            '^(?<Line>.+)$'                     { if($Inside) { $Matches.Line } }
            "(?i)^\s*$HeadText(?<Head>.*)`$"    { $Matches.Head; $Inside = $true }
            default { continue }
        }
    }
}
$File = 'Copy of 31832_226140__0001-00006.txt'
#$Path = $PSScriptRoot
$Path = 'C:\test\test3'

$Result = Get-ChildItem -Path "$Path\$File" | Process-File '001 BARTLETT' '001 BEECH'
$Result | Out-File -FilePath "$Path\SpanText.txt"

This is the output:

. Lois Elizabeth
............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
, Margaret ..........

edited Feb 23, 2022 at 3:04

answered Feb 22, 2022 at 22:33

Darin

2,4381 gold badge20 silver badges20 bronze badges

8 Comments

Darin Over a year ago

Some notes on this script. You can get rid of the tail line completely (, Margaret ..........) by removing the "$Matches.Tail;". The "." in front of "Lois Elizabeth" can be removed easily, probably need to insert something like ([.]\s)?, but not sure without experimenting. I believe blank lines are skipped, but lines with only spaces are kept, but that can be changed easily to any way you want. Remove lines with spaces, or keep all lines. Just let me know and I should be able to make the changes.

Dave Over a year ago

Great! Thanks. It basically works to requirement. I like your code layout. I take it your approach is replicating UNIX head/tail functionality. So I will take this into account when making changes.

Dave Over a year ago

One thing that puzzles me, in your code, is how to swap out hard-coded 001 BEECH & 001 BARTLETT for regex escaped variable like $pattern1 & $pattern2??

Darin Over a year ago

Dave, you have to be careful doing that. I don't think I've ever tried that, but my first approach would be to replace '(?i)^\s*001 BEECH(?<Tail>.*)$' with '(?i)^\s*'+$Pattern1+'(?<Tail>.*)$' and see if that works. The alternate approach would be to replace single quotes 'RegEx' with double quotes "RegEx", but you really have to make sure you know how the double quotes are going to react to each character in the string. It is possible that "(?i)^\s*$pattern1(?<Tail>.*)$" will work. I will have to experiment with that.

Darin Over a year ago

Dave, it worked! Process-File accepts two parameters now, $HeadText and $TailText. Each are placed in the RegEx to give the function new flexibility.

|

Collectives™ on Stack Overflow

Powershell 7.x How to Select a Text Substring of Unknown Length Only Using Boundary Substrings

2 Answers 2

6 Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related