4

Edit: The 100% correct theory is that you don't want to do this at all. However I have accepted the answer that helped the most.

So I'm being given ugly XML from a client that promises to fix it. In the meantime I need to clean it up myself. I'm looking for a regex to use in Java to add quotes around unquoted attributes. The general case is better, but so far it is only one attribute that is broken so the regex can specifically refer to "attr1". The value of the attribute is unknown, so I can't include that in the search.

<tag attr1 = VARIABLETEXT>
<tag attr1 = "VARIABLETEXT">not quoted</tag>
<tag attr1 = VARIABLETEXT attr2 = "true">
<otherTag>buncha junk</otherTag>
<tag attr1 = "VARIABLETEXT">"quoted"</tag>

Should turn into

<tag attr1 = "VARIABLETEXT">
<tag attr1 = "VARIABLETEXT">not quoted</tag>
<tag attr1 = "VARIABLETEXT" attr2 = "true">
<otherTag>buncha junk</otherTag>
<tag attr1 = "VARIABLETEXT">"quoted"</tag>

EDIT: Thank you very much for telling me not to do what I'm trying to do. However, this isn't some random, anything goes XML, where I'll run into all the "don't do it" issues. I have read the other threads. I'm looking for specific help for a specific hack.

4
  • vi filename.xml; :%s/attr1 = false/attr1 = "false"/g ... There is also gVim for Windows. Commented Feb 11, 2010 at 17:51
  • If its only temporary why dont you just use a cleaning/validation library to preprocess it? Commented Feb 11, 2010 at 17:52
  • One question: How can you tell in VARIABLETEXTattr2 where to split? Is it the fact that the next attribute starts with attr? Or a uppercase/lowercase switch? Commented Feb 11, 2010 at 20:38
  • Sorry Tim, there was supposed to be a space between VARIABLETEXT and attr2. Commented Feb 11, 2010 at 20:58

3 Answers 3

4

Do not use regex to fix/parse/process markup languages. Read here why.

Use a forgiving parser like tidy to read and fix the document in a few easy steps. There is a Java library (jtidy) you can use.

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you for that thread reference. It made life worth living.
Yeah I've read that. Can anyone just help me with the regex without preaching?
No, I'm sorry. Because there is no way to get it 100% right, there is always some weird corner case. Why is using a parser not an option?
I'll settle for 89% right then. Thanks for the parser idea. It's not not an option. I just don't have the time to do it right right now, which is why I came here for regex help.
If you suggest using tidy you should suggest the configuration options that would let OP achieve what (s)he wants
2

OK, given your constraints, you could:

Search for

<tag attr1\s*=\s*([^" >]+)

and replace with

<tag attr1 = "\1"

So, in Java, that could be (according to RegexBuddy):

String resultString = subjectString.replaceAll("<tag attr1\\s*=\\s*([^\" >]+)", "<tag attr1 = \"$1\"");

EDIT: Simplified regex a bit more.

1 Comment

Sorry, there is definitely a space between the variable text and attr2.
0

This solution wraps the first occurrence of an unquoted attribute value, even it is in between other properly-quoted attributes (or the first or last attribute):

<a id="a2" href=https://twitter.com/nlm_nih class="ff">

becomes:

<a id="a2" href="https://twitter.com/nlm_nih" class="ff">

    final String SPACE = " \r\n";
    final String ATTNAME_PATTERN = "[a-z]+(?:[-][a-z]+)*";

    // Remove any spaces before and after = (simplifies next regex)
    String wrappedAtts = targetHtml.replaceAll("[" + SPACE + "]*=[" + SPACE + "]*", "=");

    wrappedAtts = wrappedAtts.replaceAll("([<][a-z]+(?:[" + SPACE + "]+" + ATTNAME_PATTERN + "[=][\"][^\"]*[\"])*)[" + SPACE + "]+(" + ATTNAME_PATTERN + ")=([^\"][^" + SPACE + "]+)", "$1 $2=\"$3\"");

If you need to handle multiple occurrences in a tag, just put that last line in a loop and iterate until you don't find any more.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.