C# Regex filter problems

Question

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.

So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:

.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*

Also filtering some CSS files:

.css
.css*

And some SQL Files:

.sql, .ddl, .dml
.sql*, .ddl*, .dml*

My previous question got an answer to filtering Python files:

.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$

But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.

For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).

Thank you.

* in wildcards stands for any 0 or more chars. You probably want (?i)\.[xs]?htm\w*$, (?i)\.css\w*$ and (?i)\.py\w*$ / (?i)\.py[3ixw]?$. Note you still do not escape all .s. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 23, 2020 at 11:26
@WiktorStribiżew So the expression i had above, \.html|.html|.shtml|.shtm|.xhtml\*?$ wouldn't work because everything else than .xhtml had the . escaped? — Kirtstar web
– Kirtstar web, Commented Jul 23, 2020 at 11:32
Your question is not quite clear. See what your regex matches. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 23, 2020 at 11:39
@WiktorStribiżew Thank you for providing that. Last question actually, i'm taking a guess here with the filters of SQL files i mentioned above. I tried out this expression: \.[a-zA-Z]+$. Would that be a correct way to implement? Or is there another efficient way? Thanks again. — Kirtstar web
– Kirtstar web, Commented Jul 23, 2020 at 11:51
SQLite3 files can have .sq3 extension, then, you need to add digits to the regex, \.[a-zA-Z0-9]+$ — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 23, 2020 at 12:07

Chrᴉz remembers Monica · Accepted Answer · 2021-10-15 09:42:36Z

2

For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:

In C# this would basically, with IgnoreCase, be

Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);

Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):

\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*

With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.

\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*

Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:

\.(html|htm|shtml|shtm|xhtml).*

Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):

\.(s|x)?(htm)l?.*

And, I always check if it's still working in regexstorm for .Net

That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

edited Oct 15, 2021 at 9:42

answered Jul 23, 2020 at 11:52

Chrᴉz remembers Monica

1,9201 gold badge12 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Kirtstar web Over a year ago

Allright, i can now see a bunch of ways to implement this. For instance, the filter with the .html files, can be done like Wiktor said: (?i)\.[xs]?htm\w*$. But i want it in RegexOptions.IgnoreCase. Is this the best way to have it case insensitive? Also, i find that .shtm and .shtml does not match.

Chrᴉz remembers Monica Over a year ago

@Kirtstarweb In other languages you set these flags with /gmi at the end (global, multiline, case insensitive), in .net you can set them this way. (?i) should work in .net as well, I guess. And [xs] (character class) is essentially the same as (x|s). Wiktor also replaced your .*(any char) with \w which mathes only word chars(abc...), so no white spaces, line breaks or numbers. Those are further steps to refine an expression.

Kirtstar web Over a year ago

I think i see it now, i'll consider choosing which expression suites it best. Thank you for your help.

Collectives™ on Stack Overflow

C# Regex filter problems

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related