Regex find string between first string and the first instance of another string using Notepad++

Question

Sorry - I've edited this for clarity (and I've tried to remove the bold later in the post, but it's not going away...the asterisks in the source file are throwing it off):

I'm parsing medical claim files and need to find any instances of a match between one string and another but only if the match appears before another string is satisfied.

The specific strings between which I want to search are DTP*431 and REF*6R (I'm including the DTP*431 as part of it, since it will be eliminated under certain circumstances).

I need the regex to return a match if the 8 digits immediately following DTP*431*D8* exactly match the 8 digits immediately following the next instance of DTP*472*RD8* in the file, and to not continue the search after the 8 digits immediately following the next instance of DTP*472*RD8*

This example should not return a match, because the 8 digits immediately after DTP*431*D8* (20150101) do not match the 8 digits immediately following the next instance of DTP*472*RD8* (20150102):

DTP*431*D8*20150101~
[variable text in between]
LX*1~
DTP*472*RD8*20150102-20150102~
REF*6R*[more information]~
[variable text in between]
DTP*431*D8*20141231~
[variable text in between]
LX*1~
DTP*472*RD8*20150103-20150103~
REF*6R*[more information]~

This example should return a match, but should only reflect a match, because the 8 digits immediately following DTP*431*D8* (20150101) exactly match the 8 digits immediately following the first instance of DTP*472*RD8* appearing in the file (20150101):

DTP*431*D8*20150101~
[variable text in between]
LX*1~
DTP*472*RD8*20150101-20150101~
REF*6R*[more information]~
[variable text in between]
DTP*431*D8*20141231~
[variable text in between]
LX*1~
DTP*472*RD8*20150102-20150102~
REF*6R*[more information]~

This example should not return a match, because even though there is a match between the first 8 digits after an instance of DTP*431*D8* and an instance of DTP*472*RD8* (20150101), it is not the next instance of DTP*472*RD8* that provides that match:

DTP*431*D8*20150101~
[variable text in between]
LX*1~
DTP*472*RD8*20150103-20150103~
REF*6R*[more information]~
[variable text in between]
DTP*431*D8*20141231~
[variable text in between]
LX*1~
DTP*472*RD8*20150101-20150101~
REF*6R*[more information]~

Here's what I have so far:

(DTP*431*D8*)(?<=\DTP*431*D8*)([0-9]{8})(.*?)(?=\DTP*472)(DTP*472*RD8*)(\2)

...but if the date in the first instance of DTP*472 doesn't match, it is continuing past the first DTP*472 and looking until it finds any DTP*472 with a date that matches the \1 element.

So if I searched the following text, it would (undesirably) match the bolded portion:

DTP*431*D8*20150101~
[variable text in between]
LX*1~
DTP*472*RD8*20150102-20150102~
REF*6R*[more information]~
[variable text in between]
DTP*431*D8*20141231~
[variable text in between]
LX*1~
DTP*472*RD8*20150101-20150101~
REF*6R*[more information]~

What I'm trying to do is this delete the entire line of DTP*431*D8* if the 8 digits immediately following DTP*431*D8* exactly match the 8 digits immediately following the very next instance of DTP*472*RD8*

Here is a sample I got off the net and modified (I can't use actual data). This item should produce a match:

ISA*00* *01*SECRET ZZSUBMITTERS.ID ZZRECEIVERS.ID *150101*0001*^*00501*00000001*1*T*:~
GSHCSENDER CODE*RECEIVER CODE*0*0001*1*X*005010X222~
ST*837*0021*005010X222~
BHT*0019*00*244579*20150101*1023*CH~
NM1*41*2*BILLING SERVICE*****46*9999999~
PERICDOE*JOHN*3055552222*EX*111~
NM1*40*2*ABC INSURANCE COMPANY*****46*1111111~
HL*1* * 20*1~
PRVBIPXC*1234597890~
NM1*85*2*DOCTOR OFFICE*****XX*9876543210~
N3*1234 MAIN ST~
N4*LOS ANGELES*CA*11111~
REF*EI*222222222~
NM1*87*2~
N3*2345 MAIN ST~
N4*LOS ANGELES*CA*11111~
HL*2*1*22*1~
SBR*P********CI~
NM1*IL*1*DOE*JANE****MI*11332255~
DMG*D8*10000101*O~
NM1*PR*2*DEF INSURANCE COMPANY*****PI*999996666~
HL*3*2*23*0~
PAT*19~
NM1*QC*1*JONES*JOHN~
N3*111 N MAIN ST~
N4*LOS ANGELES*CA*22222~
DMG*D8*10000202*O~
CLM*888888*1***11:B:1*YAY*I~
DTP*431*D8*20150201~
REF*D9*1~
HI*BK:9999*BF:V999~
LX*1~
SV1*HC:99999*1*UN*1***1~
DTP*472*RD8*20150101-20150201~
REF*6R*000001~
SE*33*0021~
GE*1*1~
IEA*1*000000001~

Using this, if my match is found, I'd delete the DTP*431*D8 line entirely and the file would then appear as below:

ISA*00* *01*SECRET ZZSUBMITTERS.ID ZZRECEIVERS.ID *150101*0001*^*00501*00000001*1*T*:~
GSHCSENDER CODE*RECEIVER CODE*0*0001*1*X*005010X222~
ST*837*0021*005010X222~
BHT*0019*00*244579*20150101*1023*CH~
NM1*41*2*BILLING SERVICE*****46*9999999~
PERICDOE*JOHN*3055552222*EX*111~
NM1*40*2*ABC INSURANCE COMPANY*****46*1111111~
HL*120*1~
PRVBIPXC*1234597890~
NM1*85*2*DOCTOR OFFICE*****XX*9876543210~
N3*1234 MAIN ST~
N4*LOS ANGELES*CA*11111~
REF*EI*222222222~
NM1*87*2~
N3*2345 MAIN ST~
N4*LOS ANGELES*CA*11111~
HL*2*1*22*1~
SBR*P********CI~
NM1*IL*1*DOE*JANE****MI*11332255~
DMG*D8*10000101*O~
NM1*PR*2*DEF INSURANCE COMPANY*****PI*999996666~
HL*3*2*23*0~
PAT*19~
NM1*QC*1*JONES*JOHN~
N3*111 N MAIN ST~
N4*LOS ANGELES*CA*22222~
DMG*D8*10000202*O~
CLM*888888*1***11:B:1*YAYI~
REFD9*1~
HI*BK:9999*BF:V999~
LX*1~
SV1*HC:99999*1*UN*1***1~
DTP*472*RD8*20150101-20150201~
REF*6R*000001~
SE*33*0021~
GE*1*1~
IEA*1*000000001~

Conversely, this item would not get a match and the file would be left as is:

ISA*00* *01*SECRET ZZSUBMITTERS.ID ZZRECEIVERS.ID *150101*0001*^*00501*00000001*1*T*:~
GSHCSENDER CODE*RECEIVER CODE*0*0001*1*X*005010X222~
ST*837*0021*005010X222~
BHT*0019*00*244579*20150101*1023*CH~
NM1*41*2*BILLING SERVICE*****46*9999999~
PERICDOE*JOHN*3055552222*EX*111~
NM1*40*2*ABC INSURANCE COMPANY*****46*1111111~
HL*120*1~
PRVBIPXC*1234597890~
NM1*85*2*DOCTOR OFFICE*****XX*9876543210~
N3*1234 MAIN ST~
N4*LOS ANGELES*CA*11111~
REF*EI*222222222~
NM1*87*2~
N3*2345 MAIN ST~
N4*LOS ANGELES*CA*11111~
HL*2*1*22*1~
SBR*P********CI~
NM1*IL*1*DOE*JANE****MI*11332255~
DMG*D8*10000101*O~
NM1*PR*2*DEF INSURANCE COMPANY*****PI*999996666~
HL*3*2*23*0~
PAT*19~
NM1*QC*1*JONES*JOHN~
N3*111 N MAIN ST~
N4*LOS ANGELES*CA*22222~
DMG*D8*10000202*O~
CLM*888888*1***11:B:1*YAY*I~
DTP*431*D8*20150201~
REF*D9*1~
HI*BK:9999*BF:V999~
LX*1~
SV1*HC:99999*1*UN*1***1~
DTP*472*RD8*20150102-20150202~
REF*6R*000001~
SE*33*0021~
GE*1*1~
IEA*1*000000001~

Problem is not the start but the end. In your first example what makes it move past the REF*6R*[ ? Are you looking for balanced start/end ? — user557597
– user557597, Commented Mar 11, 2015 at 16:46
I'm not sure what a balanced start/end is. I wish I was better at this but most of what I've learned comes from having perused websites with no formal instruction. I'm reading up on balanced starts and endings now and might have an answer to that shortly. :) — Bartleby the Scrivener
– Bartleby the Scrivener, Commented Mar 11, 2015 at 16:53
Completely unclear. Please EDIT your question adding 1) The exact rules you want to reproduce, 2) A couple of examples with a clear explanation of why they should fail or success — Andrea Ligios
– Andrea Ligios, Commented Mar 11, 2015 at 17:02
I think you pointed me in the right direction. The problem is I don't know what I don't know. So would I do this? (?'open'(DTP*431*D8*)([0-9]{8}))+(?'-open'DTP*472*\2)+ — Bartleby the Scrivener
– Bartleby the Scrivener, Commented Mar 11, 2015 at 17:03
@BartlebytheScrivener you question is not clear enough, can you post sample data and expected output? — Federico Piazza
– Federico Piazza, Commented Mar 11, 2015 at 17:09

score 1 · Accepted Answer · 2015-03-13 00:03:19Z

1

Ok, redone. Looking at your samples now.
It gets very complicated. There are many things going on.

To try to explain it would be tedious. So, I've put all the explanation in the
regex comments.

Just replace the match with "" and you should be good to go.

Formatting, debug, testing and analysis by RegexFormat 5.

 # (?sm)(?:^[ \t]*DTP\*\d{3}\*R?D\d\*(\d{8})[^\r\n]*\r?\n(?=(?:(?!^[ \t]*DTP\*\d{3}\*R?D\d\*).)*^[ \t]*DTP\*\d{3}\*R?D\d\*(?:\1|\d{8}-\1).*?^[ \t]*REF\*6R)|^[ \t]*DTP\*\d{3}\*R?D\d\*\d{8}-(\d{8})[^\r\n]*\r?\n(?=(?:(?!^[ \t]*DTP\*\d{3}\*R?D\d\*).)*^[ \t]*DTP\*\d{3}\*R?D\d\*(?:\2|\d{8}-\2).*?^[ \t]*REF\*6R))

 (?sm)
 (?:
      ^ [ \t]* DTP\* \d{3} \*R?D \d \*
      ( \d{8} )                           # (1), Checking "first" NUMBER spot
      [^\r\n]* \r? \n                     # Grab the rest of this line

      (?=                                 # Lookahead 
           (?:                                 # Not a DTP line
                (?! ^ [ \t]* DTP\* \d{3} \*R?D \d \* )
                . 
           )*
           ^ [ \t]* DTP\* \d{3} \*R?D \d \*    # The very next 'DTP' line
           (?: \1 | \d{8} - \1 )               # Number must be in one of these spots
           .*? ^ [ \t]* REF\*6R                # The ending
      )

   |                                    ## Or, 

      ^ [ \t]* DTP\* \d{3} \*R?D \d \*
      \d{8} - 
      ( \d{8} )                           # (2), Checking "second" NUMBER spot
      [^\r\n]* \r? \n                     # Grab the rest of this line

      (?=                                 # Lookahead
           (?:                                 # Not a DTP line
                (?! ^ [ \t]* DTP\* \d{3} \*R?D \d \* )
                . 
           )*
           ^ [ \t]* DTP\* \d{3} \*R?D \d \*    # The very next 'DTP' line
           (?: \2 | \d{8} - \2 )               # Number must be in one of these spots
           .*? ^ [ \t]* REF\*6R                # The ending
      )
 )

You might want to turn the Lookaheads into a capture group ( (?=..) to (..) then adjust the backrefs to point to \1 and \3.
At this point the replacement is just \2\4 or $2$4.
What this does is move the search position past the ending, avoiding possible overlap.

 # (?sm)(?:^[ \t]*DTP\*\d{3}\*R?D\d\*(\d{8})[^\r\n]*\r?\n((?:(?!^[ \t]*DTP\*\d{3}\*R?D\d\*).)*^[ \t]*DTP\*\d{3}\*R?D\d\*(?:\1|\d{8}-\1).*?^[ \t]*REF\*6R)|^[ \t]*DTP\*\d{3}\*R?D\d\*\d{8}-(\d{8})[^\r\n]*\r?\n((?:(?!^[ \t]*DTP\*\d{3}\*R?D\d\*).)*^[ \t]*DTP\*\d{3}\*R?D\d\*(?:\3|\d{8}-\3).*?^[ \t]*REF\*6R))

 (?sm)
 (?:
      ^ [ \t]* DTP\* \d{3} \*R?D \d \*
      ( \d{8} )                           # (1), Checking "first" NUMBER spot
      [^\r\n]* \r? \n                     # Grab the rest of this line

      (                                   # (2 start), Part to be written back 
           (?:                                 # Not a DTP line
                (?! ^ [ \t]* DTP\* \d{3} \*R?D \d \* )
                . 
           )*
           ^ [ \t]* DTP\* \d{3} \*R?D \d \*    # The very next 'DTP' line
           (?: \1 | \d{8} - \1 )               # Number must be in one of these spots
           .*? ^ [ \t]* REF\*6R                # The ending
      )                                   # (2 end)

   |                                    ## Or, 

      ^ [ \t]* DTP\* \d{3} \*R?D \d \*
      \d{8} - 
      ( \d{8} )                           # (3), Checking "second" NUMBER spot
      [^\r\n]* \r? \n                     # Grab the rest of this line

      (                                   # (4 start), Part to be written back
           (?:                                 # Not a DTP line
                (?! ^ [ \t]* DTP\* \d{3} \*R?D \d \* )
                . 
           )*
           ^ [ \t]* DTP\* \d{3} \*R?D \d \*    # The very next 'DTP' line
           (?: \3 | \d{8} - \3 )               # Number must be in one of these spots
           .*? ^ [ \t]* REF\*6R                # The ending
      )                                   # (4 end)
 )

edited Mar 13, 2015 at 0:03

answered Mar 11, 2015 at 17:09

user557597

Sign up to request clarification or add additional context in comments.

9 Comments

Bartleby the Scrivener Over a year ago

I'm hoping my edited entry gives clarity, but this has the same issue my first example has. If the first instance of DTP*472*RD8*<date> doesn't match, it continues the search until it finds one that does.

user557597 Over a year ago

@BartlebytheScrivener - Fortunately, that's exactly what this regex does. Did you examin the Output ?

Bartleby the Scrivener Over a year ago

Edit: I think I missed something. It's actually starting with a different DTP segment in my production file and that threw me off. I'm re-testing now.

Bartleby the Scrivener Over a year ago

I've tried editing it to only use the 431 to start and the 472 to end and it's going past the first 472 to find the next instance of a 472 that has a date matching that of the first. I've also tried leaving the 472 as just being \d{3} to no avail.

user557597 Over a year ago

@BartlebytheScrivener - Have a look at the modified one.

|

Federico Piazza · Accepted Answer · 2015-03-11 17:05:47Z

1

I'm not quite sure if I understood your question, but assuming you want to capture the content between DTP*431 and REF*6R incluiding DTP*431, then you can use this regex:

(DTP\*431.*?)REF\*6R

Working demo

You save the content into a capturing group discarding REF*6R. You can see in blue the matches and in green the capturing group content.

enter image description here

answered Mar 11, 2015 at 17:05

Federico Piazza

31.2k15 gold badges91 silver badges133 bronze badges

Collectives™ on Stack Overflow

Regex find string between first string and the first instance of another string using Notepad++

2 Answers 2

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related