Java/Parsing: how to replace & symbol but not html codes

Question

I need to replace all "&" symbols with "&#38" in my text file but not the html codes such as & or "

I'm currently using row = row.replace("& ", "&#38");

but, as I said also the html codes are replaced e.g. " and I don't want this.. thanks

ps. I cannot add spaces after & because I need to replace it in words such as M&M or Ella & David

Possible duplicate of: stackoverflow.com/questions/240546/… — Adriaan Koster
– Adriaan Koster, Commented Feb 24, 2011 at 11:41

Johan Sjöberg · Accepted Answer · 2011-02-24 10:32:10Z

4

You could try a regex, e.g,

row = row.replaceAll("&(?![#a-zA-Z0-9]+;)", "&#38;");

The regex replace & given that it's not followed by a sequence of '#a-zA-Z0-9' ending with ';'

edited Feb 24, 2011 at 10:32

answered Feb 24, 2011 at 9:59

Johan Sjöberg

49.4k22 gold badges135 silver badges150 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

aneuryzm Over a year ago

sorry, there was an error in my question. The html codes do not have # after the &, but they have few letters (different lenght) ending with a ;

adarshr Over a year ago

Your regex doesn't work for ō form of strings. What you probably need is row.replaceAll("&(?![#a-zA-Z0-9]+;)", "&");

Johan Sjöberg Over a year ago

@adarshr, that wasn't clear from the question, but in all fairness, you are completely right! I'll update accordingly, thx.

maaartinus · Accepted Answer · 2011-02-24 10:13:26Z

1

There's no general solution, since in your text there may be things like

&amp;

which may mean either a single ampersand or be a malformed way of saying & which should be expressed as

&amp;amp;

However, the latter is quite improbable (unless you're escaping some HTML).

So try something like

row = row.replaceAll("&(?!(?:\\#|amp|quot|nbsp|\\d+);)", "&amp;");

Btw., &#38 is missing the final semicolon. Prefer & to using ASCII codes.

answered Feb 24, 2011 at 10:13

maaartinus

46.8k40 gold badges176 silver badges343 bronze badges

Comments

Christoffer Hammarström · Accepted Answer · 2011-02-24 10:11:30Z

0

Try

String replacedAmpersands = row.replaceAll("&(?!(?:#\\d+|\\p{L}+);)", "&#38;")

This will only replace ampersands that are not followed by #\d+; (hash, numbers, semicolon) or \p{L}+; (letters, semicolon).

answered Feb 24, 2011 at 10:11

Christoffer Hammarström

28k4 gold badges54 silver badges59 bronze badges

Comments

MattLBeck · Accepted Answer · 2011-02-24 10:13:59Z

0

The pattern "& " should be "&\\s", since whitespace has a pattern identifier too.

So the line should read row = row.replace("&\\s", "&#38");

edited Feb 24, 2011 at 10:13

answered Feb 24, 2011 at 10:05

MattLBeck

5,8818 gold badges42 silver badges57 bronze badges

1 Comment

MattLBeck Over a year ago

except I just noticed your postscript, so this wouldn't catch the & in M&M

Community · Accepted Answer · 2017-05-23 10:33:08Z

0

This solution is more involved but my feeling is that it is fullproof, whereas the regex solutions may not be 100% correct (as per the famous "do not use regex for HTML stackoverflow thread").

Using Jsoup:

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

This will give you for sure a text only containing the ampersands you need, not the rest.

Then create a Map containing on the left-hand side the phrases like M&M and Ella & David and then on the right hand side the phrases M&M and Ella & David

The final step is going back to the initial HTML text and replacing the strings on the LHS of the map with those of the RHS.

Edit: you can of course use any HTML parser you like - just wanted to give you a quick example of how easy it is to use one.

edited May 23, 2017 at 10:33

CommunityBot

11 silver badge

answered Feb 24, 2011 at 10:39

Lucas Zamboulis

2,5515 gold badges24 silver badges28 bronze badges

Collectives™ on Stack Overflow

Java/Parsing: how to replace & symbol but not html codes

5 Answers 5

3 Comments

Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related