0

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.

So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:

.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*

Also filtering some CSS files:

.css
.css*

And some SQL Files:

.sql, .ddl, .dml
.sql*, .ddl*, .dml*

My previous question got an answer to filtering Python files:

.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$

But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.

For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).

Thank you.

6
  • * in wildcards stands for any 0 or more chars. You probably want (?i)\.[xs]?htm\w*$, (?i)\.css\w*$ and (?i)\.py\w*$ / (?i)\.py[3ixw]?$. Note you still do not escape all .s. Commented Jul 23, 2020 at 11:26
  • @WiktorStribiżew So the expression i had above, \.html|.html|.shtml|.shtm|.xhtml\*?$ wouldn't work because everything else than .xhtml had the . escaped? Commented Jul 23, 2020 at 11:32
  • Your question is not quite clear. See what your regex matches. Commented Jul 23, 2020 at 11:39
  • @WiktorStribiżew Thank you for providing that. Last question actually, i'm taking a guess here with the filters of SQL files i mentioned above. I tried out this expression: \.[a-zA-Z]+$. Would that be a correct way to implement? Or is there another efficient way? Thanks again. Commented Jul 23, 2020 at 11:51
  • SQLite3 files can have .sq3 extension, then, you need to add digits to the regex, \.[a-zA-Z0-9]+$ Commented Jul 23, 2020 at 12:07

1 Answer 1

2

For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:

In C# this would basically, with IgnoreCase, be

Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);

Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):

\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*

With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.

\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*

Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:

\.(html|htm|shtml|shtm|xhtml).*

Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):

\.(s|x)?(htm)l?.*

And, I always check if it's still working in regexstorm for .Net

That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

Sign up to request clarification or add additional context in comments.

3 Comments

Allright, i can now see a bunch of ways to implement this. For instance, the filter with the .html files, can be done like Wiktor said: (?i)\.[xs]?htm\w*$. But i want it in RegexOptions.IgnoreCase. Is this the best way to have it case insensitive? Also, i find that .shtm and .shtml does not match.
@Kirtstarweb In other languages you set these flags with /gmi at the end (global, multiline, case insensitive), in .net you can set them this way. (?i) should work in .net as well, I guess. And [xs] (character class) is essentially the same as (x|s). Wiktor also replaced your .*(any char) with \w which mathes only word chars(abc...), so no white spaces, line breaks or numbers. Those are further steps to refine an expression.
I think i see it now, i'll consider choosing which expression suites it best. Thank you for your help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.