2

I want a regex solution to find some text value which looks like MLA818214667 and this value placed in a id like id="MLA818214667". There should be 3 type of pattern to find these value from HTML.

  1. It should start with MLA and placed in id="".
  2. The number after MLA should be more than 6 characters long.
  3. The number should be fully numeric not string mixed.

Note: I want to avoid HtmlAgilityPack for this case because the text not always valid html. So i want to treat it as text not html and need solution without any html parser

C#:

var listOfIds = new List<string>();
string html = @"below html sample goes here";

Match match = Regex.Match(input, @"/([A-Za-z0-9\-]+)\.$",
            RegexOptions.IgnoreCase);

//from matched ids it should be added in list listOfIds 

Html:

<span class="main-title">
  Casco Integral Halcon H57 + Combo Termico Invierno Sti Motos
</span>
</h2>
<div class="item__status">
  <div class="item__condition">541 vendidos</div>
</div>
</div>
</a>
<form class="item__bookmark-form" action="/search/bookmarks/MLA614364106/make" method="post" id="bookmarkForm" class="bookmark-form">
  <button type="submit" class="bookmarks favorite" data-id="MLA614364106">
    <div class="item__bookmark">
      <div class="icon"></div>
    </div>
  </button>
  <input type="hidden" name="method" value='add'/>
  <input type="hidden" name="itemId" value='MLA614364106'/>
  <input type="hidden" name="_csrf" value="5fe7b4e6-19d3-42bc-a3bb-15eaeee81f64"/>
</form>
</div>
</li>
<li class="results-item highlighted article grid item-info-height-179">
  <div class="rowItem item highlighted item--grid item--has-row-logo new" id="MLA751765547">
    <div class="item__image item__image--grid">
      <div class="images-viewer" item-url="https://articulo.mercadolibre.com.ar/MLA-751765547-casco-moto-hawk-htl-dr46-rebatible-lett-store-_JM#position=5&amp;type=item&amp;tracking_id=897c653e-1565-4371-8a4d-b2ea29d09d4d" item-id="MLA751765547">
        <div class="carousel">
          <ul>
            <li><a href="https://articulo.mercadolibre.com.ar/MLA-751765547-casco-moto-hawk-htl-dr46-rebatible-lett-store-_JM#position=5&amp;type=item&amp;tracking_id=897c653e-1565-4371-8a4d-b2ea29d09d4d" class="item-link item__js-link">
                  <img class='lazy-load' width='284' height='284' alt='Casco Moto Hawk Htl Dr46 Rebatible Lett Store' src='https://http2.mlstatic.com/casco-moto-hawk-htl-dr46-rebatible-lett-store-D_NQ_NP_624166-MLA31021954439_062019-W.jpg'/>
                </a>
            </li>
          </ul>
        </div>
      </div>
    </div>
    <span class="item-loading__status-bar item-loading__hide"></span>
    <a href="https://articulo.mercadolibre.com.ar/MLA-751765547-casco-moto-hawk-htl-dr46-rebatible-lett-store-_JM#position=5&amp;type=item&amp;tracking_id=897c653e-1565-4371-8a4d-b2ea29d09d4d" class="item__info-link item__js-link">
      <div class="item__info ">
        <div class="item__price ">
          <span class="price__symbol">$</span>
          <span class="price__fraction">3.725</span>
        </div>
        <span class="item-installments item__installments--show-card-icon highlighted free-interest item--has-shipping">
          <span class="item-installments-text">Hasta 6 cuotas sin inter&eacute;s</span>
        </span>
        <div class="item__shipping-promise item__shipping highlighted free-shipping">
          <span class="text-shipping next_day">Llega gratis el lunes</span>
        </div>
        <div class="item__brand-logo item__brand-img--ultra-wide">
          <span class="item__brand-img-container">
            <img src="https://http2.mlstatic.com/D_NQ_NP_796276-MLA31050681849_062019-T.jpg"/>
          </span>
        </div>
        <h2 class="item__title list-view-item-title">
          <span class="main-title">Casco Moto Hawk Htl Dr46 Rebatible Lett Store</span>
        </h2>
        <div class="item__status">
          <div class="item__condition">362 vendidos</div>
        </div>
      </div>
    </a>
    <form class="item__bookmark-form" action="/search/bookmarks/MLA751765547/make" method="post" id="bookmarkForm" class="bookmark-form">
      <button type="submit" class="bookmarks favorite" data-id="MLA751765547">
        <div class="item__bookmark">
          <div class="icon"></div>
        </div>
      </button>
      <input type="hidden" name="method" value='add'/>
      <input type="hidden" name="itemId" value='MLA751765547'/>
      <input type="hidden" name="_csrf" value="5fe7b4e6-19d3-42bc-a3bb-15eaeee81f64"/>
    </form>
  </div>
</li>
<li class="results-item highlighted article grid item-info-height-179">
  <div class="rowItem item highlighted item--grid item--has-row-logo new to-item" id="MLA817988063">
    <div class="item__image item__image--grid">
      <div class="images-viewer" item-url="https://articulo.mercadolibre.com.ar/MLA-817988063-cascos-motos-vega-vflow-motocross-mx-enduro-atv-acces-cam-_JM#position=6&amp;type=item&amp;tracking_id=897c653e-1565-4371-8a4d-b2ea29d09d4d" item-id="MLA817988063">
        <div class="carousel">
          <ul>
            <li>
              <a href="https://articulo.mercadolibre.com.ar/MLA-817988063-cascos-motos-vega-vflow-motocross-mx-enduro-atv-acces-cam-_JM#position=6&amp;type=item&amp;tracking_id=897c653e-1565-4371-8a4d-b2ea29d09d4d" class="item-link item__js-link">
                <img class='lazy-load' width='284' height='284' alt='Cascos Motos Vega Vflow Motocross Mx Enduro Atv + Acces Cam' src='https://http2.mlstatic.com/cascos-motos-vega-vflow-motocross-mx-enduro-atv-acces-cam-D_NQ_NP_629038-MLA32405702773_102019-W.jpg' />
              </a>
            </li>
          </ul>
        </div>
6
  • 1) Is there a reason you can't use an HTML parser (example) and then validate the Id? 2) The value you are trying to match doesn't even exist in the HTML sample you provided. Commented Nov 9, 2019 at 8:27
  • Mandatory reading Commented Nov 9, 2019 at 8:28
  • Yes..in this html u will find multiple values like MLA1234567891. I want to treat it as normal text and find value as per pattern. I dont want to treat document as html. because it is not valid html always. text is totally random @AhmedAbdelhameed Commented Nov 9, 2019 at 8:30
  • With HtmlAgilityPack, you may even extract it all without a regex. Or with both HtmlAgilityPack and a small regex, it will at least make id identification safer. Commented Nov 9, 2019 at 8:46
  • @WiktorStribiżew I know i can go with HtmlAgilityPack but in my previous comments i said that the input html is not valid html always. Some time it is random string. So i cant use HtmlAgilityPack in this case. I am looking to avoid HtmlAgilityPack Commented Nov 9, 2019 at 8:48

1 Answer 1

3

You can use this example "id=\"(MLA[0-9]{6,})\"" to find all the values of id form HTML

Paste the RegEx in here https://regex101.com to see how it works

 static void Main(string[] args)
    {
        var listOfIds = new List<string>();
        string html = " id=\"MLA12334566\"  id=\"MLA123354566\" id=\"MLA123346566\"";

        Regex idRegex = new Regex("id=\"(MLA[0-9]{6,})\"");

        var matches = idRegex.Matches(html);

        foreach(var match in matches)
        {
            listOfIds.Add(match.ToString());
        }
    }
Sign up to request clarification or add additional context in comments.

1 Comment

You posted a corrupt verbatim string literal (@"id=\"(MLA[0-9]{6,})\""), it won't compile in C#. Even if you fix that, you need to add correct code to use since OP current code won't be able to work with this regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.