0

I am trying to store a text file string which has a beginning and end that make it a substring of the original text file. I am new to Powershell so my methods are simple/crude. Basically my approach has been:

  1. Roughly get what I want from the start of the string
  2. Worry about trimming off what I don't want later

My minimum reproducible example is as follows:

# selectStringTest.ps    
         
$inputFile = Get-Content -Path "C:\test\test3\Copy of 31832_226140__0001-00006.txt"

#  selected text string needs to span from $refName up to $boundaryName 
[string]$refName = "001 BARTLETT"
[string]$boundaryName = "001 BEECH"

# a rough estimate of the text file lines required
[int]$lines = 200
   
if (Select-String  -InputObject $inputFile -pattern $refName) {
    Write-Host "Selected shortened string found!"
    # this selects the start of required string but with extra text
    [string]$newFileStart = $inputFile | Select-String $refName -CaseSensitive -SimpleMatch -Context 0, $lines   
}
else {
    Write-Host "Selected string NOT FOUND."
}
# tidy up the start of the string by removing rubbish
$newFileStart = $newFileStart.TrimStart('> ')

# this is the kind of thing I want but it doesn't work
$newFileStart = $newFileStart - $newFileStart.StartsWith($boundaryName)

$newFileStart | Out-File tempOutputFile

As it is: the output begins correctly but I cannot remove text including and after $boundaryName

The original text file is OCR generated (Optical Character Recognition) So it is unevenly formatted. There are newlines in odd places. So I have limited options when it comes to delimiting.

I am not sure my if (Select-String -InputObject $inputFile -pattern $refName)is valid. It appears to work correctly. The general design seems crude. In that I am guessing how many lines I will need. And finally I have tried various methods of trimming the string from $boundaryName without success. For this:

  • string.split() not practical
  • replacing spaces with newlines in an array & looping through to elements of $boundaryName is possible but I don't know how to terminate the array at this point before returning it to string.

Any suggestions would be appreciated.

Abbreviated content of x2 200 listings single Copy of 31832_226140__0001-00006.txt file is:

Beginning of text file

________________

BARTLETT-BEDGGOOD
PENCARROW COMPOSITE ROLL
PAGE 6
PAGE 7
PENCARROW COMPOSITE ROLL
BEECH-BEST
www.
.......................
001 BARTLETT. Lois Elizabeth

Middle of text file

............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
001 BEECH, Margaret ..........

End of text file

..............312 Munita Rood Eastbourne, Civil Eng 200 BEST, Dons Amy .........
..........50 Man Street, Wamuomata, Marned
SO NON
6
  • 2
    Could you share an example of the file you want to parse? Commented Feb 22, 2022 at 21:10
  • 1
    Each text file is approximately 30KB. They span 250 lines in Notepad++ I will add a 'sample'. That is edited start text-end text. Commented Feb 22, 2022 at 21:28
  • 2
    Yeah, not asking for the complete file, just a representation of how the file looks and what would you like to have as a result Commented Feb 22, 2022 at 21:29
  • 2
    Just to confirm I understood correctly, you're looking to extract all the text between 001 BARTLETT and 001 BEECH ? And if so, do you want to include or exclude those key words ? Commented Feb 22, 2022 at 22:10
  • 2
    Should the boundary strings always begin at the beginning of a line? Should the lines containing the boundary strings be included in the output? Commented Feb 22, 2022 at 22:13

2 Answers 2

1

To use a regex across newlines, the file needs to be read as a single string. Get-Content -Raw will do that. This assumes that you do not want the lines containing refName and boundaryName included in the output

$c = Get-Content -Path '.\beech.txt' -Raw
$refName = "001 BARTLETT"
$boundaryName = "001 BEECH"

if ($c -match "(?smi).*$refName.*?`r`n(.*)$boundaryName.*?`r`n.*") {
    $result = $Matches[1]
}
$result

More information at https://stackoverflow.com/a/12573413/447901

Sign up to request clarification or add additional context in comments.

6 Comments

Great! Thanks. This works. The only negative is that it includes the boundary listing.
Thanks for the SO reference post. Your regex was confusing me. The post will help understanding.
Your change fixed the output end but screwed up the output start. Don't worry. You've been a great help. I have to dig into the details more. So I'll be able to correct it at some stage. Thanks.
@Dave, are you wanting the text on the same line after the starting boundary to be in the output?
Should the first line of output be 001 BARTLETT. Lois Elizabeth .......? That is easy enough to get in.
|
1

How close does this come to what you want?

function Process-File {
    param (
        [Parameter(Mandatory = $true, Position = 0)]
        [string]$HeadText,
        [Parameter(Mandatory = $true, Position = 1)]
        [string]$TailText,
        [Parameter(ValueFromPipeline)]
        $File
    )
    Process {
        $Inside = $false;
        switch -Regex -File $File.FullName {
            #'^\s*$' { continue }
            "(?i)^\s*$TailText(?<Tail>.*)`$"    { $Matches.Tail; $Inside = $false }
            '^(?<Line>.+)$'                     { if($Inside) { $Matches.Line } }
            "(?i)^\s*$HeadText(?<Head>.*)`$"    { $Matches.Head; $Inside = $true }
            default { continue }
        }
    }
}
$File = 'Copy of 31832_226140__0001-00006.txt'
#$Path = $PSScriptRoot
$Path = 'C:\test\test3'

$Result = Get-ChildItem -Path "$Path\$File" | Process-File '001 BARTLETT' '001 BEECH'
$Result | Out-File -FilePath "$Path\SpanText.txt"

This is the output:

. Lois Elizabeth
............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
, Margaret ..........

8 Comments

Some notes on this script. You can get rid of the tail line completely (, Margaret ..........) by removing the "$Matches.Tail;". The "." in front of "Lois Elizabeth" can be removed easily, probably need to insert something like ([.]\s)?, but not sure without experimenting. I believe blank lines are skipped, but lines with only spaces are kept, but that can be changed easily to any way you want. Remove lines with spaces, or keep all lines. Just let me know and I should be able to make the changes.
Great! Thanks. It basically works to requirement. I like your code layout. I take it your approach is replicating UNIX head/tail functionality. So I will take this into account when making changes.
One thing that puzzles me, in your code, is how to swap out hard-coded 001 BEECH & 001 BARTLETT for regex escaped variable like $pattern1 & $pattern2??
Dave, you have to be careful doing that. I don't think I've ever tried that, but my first approach would be to replace '(?i)^\s*001 BEECH(?<Tail>.*)$' with '(?i)^\s*'+$Pattern1+'(?<Tail>.*)$' and see if that works. The alternate approach would be to replace single quotes 'RegEx' with double quotes "RegEx", but you really have to make sure you know how the double quotes are going to react to each character in the string. It is possible that "(?i)^\s*$pattern1(?<Tail>.*)$" will work. I will have to experiment with that.
Dave, it worked! Process-File accepts two parameters now, $HeadText and $TailText. Each are placed in the RegEx to give the function new flexibility.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.