2

I want to be able to delete all instances of newlines within <p> tags, but not the ones outside. Example:

<p dir="ltr">Test<br>\nA\naa</p>\n<p dir="ltr">Bbb</p>

This is the regex I came up with:

(<p[^>]*?>)(?:(.*)\n*)*(.*)(</p[^>]*?>)

and I replace with:

$1$2$3$4

I was hoping that this would work but (?:(.*)\n*)* seems to be causing issues. Is there any way to do repeated matches like this, with a capturing group?

Thanks in advance!

6
  • there are two p tags? you want \n to be removed separately for them? Commented May 23, 2016 at 18:12
  • Separately for p tags is fine. Its just that I'm hoping to replace all the \n within the p tags in one fell swoop. I was hoping that its possible with regex without nested loops. Commented May 23, 2016 at 18:16
  • 1
    stackoverflow.com/questions/1732348/… Commented May 23, 2016 at 18:24
  • 2
    I would recommend using something like JSoup for this kind of work. Commented May 23, 2016 at 18:25
  • 1
    @ThePerson makes sense. Thanks. Commented May 23, 2016 at 18:25

1 Answer 1

2

Solution

You can use this regex(works in PCRE but not in Java. For Java version refer below)

(?s)(?:<p|\G(?!\A))(?:(?!<\/p>).)*?\K[\n\r]+

Regex Demo

Regex Breakdown

(?s) #Enable . to match newlines

(?:
   <p #this part is to assure that whatever we find is inside <p tag
    | #Alternation(OR)
   \G(?!\A) #Find the position of starting of previous match.
)

(?:
  (?!<\/p>). #Till it is impossible to match </p>, match .
)*? #Do it lazily

\K #Whatever is matched till now discard it

[\n\r]+ #Find \n or \r

Java Code

With a bit of modification, I was able to do it in Java

String line = "<p dir=\"ltr\">Test<br>\nA\naa</p>\nabcd\n<p dir=\"ltr\">Bbb</p>"; 
System.out.println(line.replaceAll("(?s)((?:<p|\\G(?!\\A))(?:(?!<\\/p>).)*?)[\\n\\r]+", "$1"));

Ideone Demo

Sign up to request clarification or add additional context in comments.

17 Comments

Holy... Wow. that's pretty darn amazing.
damn my regex noobness! good job rock - i was too slow to be the savior.
@Jun first let me check it in JAVA
Just was about to add answer "(?s)\\n+(?=(?:(?!<p).)*?</p)" as a Java idea when I saw you did update for Java. I like your answer.
@rock321987 Your pattern is more accurate and if there is a long html input with many \n outside of <p I think your approach will outperform my idea as mine would trigger the lookahead at any \n+.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.