1

I receive a text like this in my program

Ký Sinh Trùng   - (2019)

Which is wrong and should be as follows

Ký Sinh Trùng  - (2019)

I used the following code but nothing happened

byte[] bytes = Encoding.Default.GetBytes(nodes.InnerText);
var myString = Encoding.UTF8.GetString(bytes);

how can i fix this?

Update: Full Code:

HtmlWeb Webget = new HtmlWeb();

var docx = await Webget.LoadFromWebAsync(@"https://isubtitles.org/search?kwd=parasite");

var items = docx.DocumentNode.SelectNodes("//div[@class='movie-list-info']");

foreach (var node in items)
 {
   var name = node?.SelectSingleNode(".//div/div[2]/h3/a");
   var xxxx = name?.InnerText;
   

   byte[] bytes = Encoding.UTF8.GetBytes(xxxx);
   var myString = Encoding.UTF8.GetString(bytes);
   Debug.WriteLine(myString);
   return;
 }
3
  • 2
    Those characters appear to be HTML encoded. I would double check whether the source has these HTML encoded characters to determine whether HtmlAgilityPack is to blame. Commented Sep 3, 2021 at 9:11
  • There's no problem at all. ý is the HTML-encoded form of ý, not UTF8. Browsers will display it just fine. This page is UTF8, which is why it can display Ký Sinh Trùng or Αυτό Εδώ without requiring explicit ... HTML encoding Commented Sep 3, 2021 at 9:37
  • UTF8 specifies how text is converted to bytes. It doesn't specify any kind of escape sequences. It's no different to ASCII/Latin1 in that regard. Almost all web sites use UTF8. Commented Sep 3, 2021 at 9:42

1 Answer 1

1

That's just HTML encoded text. It's fine. If you need to decode it, then:

System.Net.WebUtility.HtmlDecode(theHtmlEncodedString)

https://learn.microsoft.com/en-us/dotnet/api/system.net.webutility.htmldecode?view=net-5.0

or (if you have System.Web loaded):

System.Web.HttpUtility.HtmlDecode(theHtmlEncodedString)

https://learn.microsoft.com/en-us/dotnet/api/system.web.httputility.htmldecode?view=net-5.0

Sign up to request clarification or add additional context in comments.

2 Comments

thank you it is now worked, one more question: If the text is not encoded, does it not cause a problem?
@karma Only if it coincidentally contains escape sequences such as  , < etc. This seems unlikely.