0

We have some XML files which we get as input (whose format is not under our control).

<?xml version="1.0" encoding="UTF-8"?>
<GroupFile..>
    <Group id="10" desc="Description">
        <Member id="117">&#x00B0;</Member>
    </Group>    
</GroupFile>

This file can contain HTML entity code representation of symbols like "°" (represented as "&#x00B0;" in hex). This file is deserialized to Group and Member class objects. When XML deserializing the Member element value is correctly read as "°" and displayed in a grid. When serializing back the earlier objects back into XML, the Member value is saved as "°" instead of "&#x00B0;".

Deserialization - Correct

<Member id="117">&#x00B0;</Member> deserializes into Member object with value °

Serialization - Issue here

The same Member object with value ° serializes into <Member id="117">°</Member>instead of <Member id="117">&#x00B0;</Member>

How can this be prevented and get it serialized back as "&#x00B0;" ?

2 Answers 2

2

You must then apply a custom serialization/deserialization to do so.

Using HttpUtility.HtmlEncode/HtmlDecode is not sufficient since it provide the decimal encoding. I added the following (could be improved in terms of error catching) to keep the hex escaped characters in the xml serialization.

Update: In order to avoid automatic escape of special character, you must write a custom Xml serializer for the class as seen below and use WriteRaw

If you use the XmlSerializer:

public class GroupFile
{
    [XmlElement("Group")]
    public Group[] Groups { get; set; }
}

public class Group
{
    [XmlAttribute("id")]
    public int Id { get; set; }

    [XmlElement("Member")]
    public Member[] Members { get; set; }
}

[Serializable]
public class Member : IXmlSerializable
{

    public static string DecimalToHexadecimalEncoding(string html)
    {
        var splitted = html.Split('#');
        var res = Int32.Parse(splitted[1].Replace(";", string.Empty));
        return "&#x" + res.ToString("x4") + ";";
    }

    [XmlAttribute("id")]
    public int Id { get; set; }       

    [XmlIgnore]
    public string Value { get; set; }

    [XmlText]
    public string HexValue
    {
        get
        {
            // convert to hex representation
            var res = HttpUtility.HtmlEncode(Value);
            res = DecimalToHexadecimalEncoding(res);
            return res;
        }
    }

    public XmlSchema GetSchema()
    {
        return null;
    }

    public void ReadXml(XmlReader reader)
    {
        var attributeValue = reader.GetAttribute("id");
        if (attributeValue != null)
        {
            Id = Int32.Parse(attributeValue);
        }
        // Here the value is directly converted to string "°"
        Value = reader.ReadElementString();            
        reader.ReadEndElement();           
    }

    public void WriteXml(XmlWriter writer)
    {
        writer.WriteAttributeString("id", Id.ToString());
        writer.WriteRaw(HexValue);
    }
}
Sign up to request clarification or add additional context in comments.

3 Comments

&, <, > etc are special XML characters which are escaped during xml serialization. The question was about HTML entity symbols like ° deserializing properly from escape codes but serializing as such not as codes. <Member id="117">&#x00B0;</Member> deserializes properly and i get member value as °. But when serializing back I get <Member id="117">°</Member> instead of <Member id="117">&#x00B0;</Member>
@Socrates I've updated the answer with your comment.
Member object with Value = "°". HexValue after calling DecimalToHexadecimalEncoding will be "&#x00b0;". And during serialization & will be escaped automatically by the serializer. So the end result will be <Member id="117">&amp;#x00b0</Member> which makes the result invalid again.
1

You can use HSharp to deserialize HTML. HSharp is a library used to analyse markup language like HTML easily and fastly. Install:Install-Package Obisoft.HSharp

var NewDocument = HtmlConvert.DeserializeHtml($@"
<html>
<head>
    <meta charset={"\"utf-8\""}>
    <meta name={"\"viewport\""}>
    <title>Example</title>
</head>
<body>
<h1>Some Text</h1>
<table>
    <tr>OneLine</tr>
    <tr>TwoLine</tr>
    <tr>ThreeLine</tr>
</table>
</body>
</html>");

Console.WriteLine(NewDocument["html"]["head"]["meta",0].Properties["charset"]);
Console.WriteLine(NewDocument["html"]["head"]["meta",1].Properties["name"]);
foreach (var Line in NewDocument["html"]["body"]["table"])
{
    Console.WriteLine(Line.Son);
}

That will output:

utf-8
viewport
OneLine
TwoLine
ThreeLine

and you can also foreach the tag in html.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.