5

I need to replace all "&" symbols with "&#38" in my text file but not the html codes such as & or "

I'm currently using row = row.replace("& ", "&#38");

but, as I said also the html codes are replaced e.g. " and I don't want this.. thanks

ps. I cannot add spaces after & because I need to replace it in words such as M&M or Ella & David

2

5 Answers 5

4

You could try a regex, e.g,

row = row.replaceAll("&(?![#a-zA-Z0-9]+;)", "&");

The regex replace & given that it's not followed by a sequence of '#a-zA-Z0-9' ending with ';'

Sign up to request clarification or add additional context in comments.

3 Comments

sorry, there was an error in my question. The html codes do not have # after the &, but they have few letters (different lenght) ending with a ;
Your regex doesn't work for ō form of strings. What you probably need is row.replaceAll("&(?![#a-zA-Z0-9]+;)", "&");
@adarshr, that wasn't clear from the question, but in all fairness, you are completely right! I'll update accordingly, thx.
1

There's no general solution, since in your text there may be things like

&

which may mean either a single ampersand or be a malformed way of saying & which should be expressed as

&

However, the latter is quite improbable (unless you're escaping some HTML).

So try something like

row = row.replaceAll("&(?!(?:\\#|amp|quot|nbsp|\\d+);)", "&");

Btw., &#38 is missing the final semicolon. Prefer & to using ASCII codes.

Comments

0

Try

String replacedAmpersands = row.replaceAll("&(?!(?:#\\d+|\\p{L}+);)", "&")

This will only replace ampersands that are not followed by #\d+; (hash, numbers, semicolon) or \p{L}+; (letters, semicolon).

Comments

0

The pattern "& " should be "&\\s", since whitespace has a pattern identifier too.

So the line should read row = row.replace("&\\s", "&#38");

1 Comment

except I just noticed your postscript, so this wouldn't catch the & in M&M
0

This solution is more involved but my feeling is that it is fullproof, whereas the regex solutions may not be 100% correct (as per the famous "do not use regex for HTML stackoverflow thread").

Using Jsoup:

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

This will give you for sure a text only containing the ampersands you need, not the rest.

Then create a Map containing on the left-hand side the phrases like M&M and Ella & David and then on the right hand side the phrases M&M and Ella & David

The final step is going back to the initial HTML text and replacing the strings on the LHS of the map with those of the RHS.

Edit: you can of course use any HTML parser you like - just wanted to give you a quick example of how easy it is to use one.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.