Is this PCRE regex convertible by pcre2el to Emacs regex syntax?

Question

Is this PCRE regex convertible by pcre2el to Emacs syntax?

\[\[[^\s\/]+:\/*(\/[^]]*)\]*\[[^]]*\]\]

The result of the conversion is [ /[^]+:/*\(/[^]]*\)]*[[^]]*]] and when I apply it in a query-replace-regexp command it doesn't work.

 #+begin_src emacs-lisp
   (pcre-to-elisp "\[\[[^\s\/]+:\/*(\/[^]]*)\]*\[[^]]*\]\]")
 #+end_src

 #+RESULTS:
 : [ /[^]+:/*\(/[^]]*\)]*[[^]]*]]

I am not sure whether some additional processing is needed on the original PCRE string before applying it to the pcre-to-elisp function, or it is a string pcre-to-elisp is unable to translate.

The regex is an answer I got from the r/regex subreddit. Answer to : Regex to reduce repeated instances of a character to a set number (usually 1)

Here is the question fully reproduced below.

This is an example of an org-mode link

[[file:/abc/def/ghi][Abc Def Ghi]]

I've found myself with a file (actually my own doing) where some of the lines have multiple slashes after the url type, eg.

[[file://////abc/def/ghi][Abc Def Ghi]]

I need a regex that can extract the actual link. I have succeeded partially but I want to do it one go as it will be used in a script.

So applying the regex to [[file://////abc/def/ghi][Abc Def Ghi]] should result in /abd/def/ghi.

I have come up with \[\[\([a-z0-9_/.]*\)\].* -> \1, but I need something more to strip the url type and the superflous forward slashes, ie all but the last one.

Can you describe in words what the regex is supposed to match? — NickD
– NickD, Commented Sep 29, 2024 at 12:47

db48x · Accepted Answer · 2024-09-29 19:48:57Z

1

You forgot that every \ character inside an elisp string must be escaped, turning it into \\.

Edit:

There is also the newer rx notation that provides a much more readable regex syntax. You could write your regex like this:

(rx "[["
    (one-or-more (not (in space "/")))
    ":"
    (zero-or-more "/")
    )

See chater 35.3.3 The ‘rx’ Structured Regexp Notation of the Emacs Lisp Reference Manual.

edited Sep 29, 2024 at 19:48

answered Sep 29, 2024 at 15:40

db48x

20k1 gold badge25 silver badges30 bronze badges

I thought I had that covered by applying regex-quote which in my mind seemed to be an Elisp version of PHP's preg-quote, but it clearly didn't work. I take it then that there is no way to get elisp to process a string with backslashes in it correctly without doing it doubling the slashes manually first? Is there no syntax or metasyntax for doing that all?

vfclists
– vfclists

2024-09-29 19:29:27 +00:00
Commented Sep 29, 2024 at 19:29
Your question indicates a level of confusion. The elisp string syntax uses backslashes as an escape mechanism, and the elisp regex syntax uses backslashes as a syntax mechanism. Within a regex, a '[' is the beginning of a character class, but '[' matches an open square bracket. But every regex goes inside a string, which treats '[' as an escaped square bracket character. The escape mechanism means that the string ends up containing only the square bracket. So you need to double up the backslashes.

db48x
– db48x

2024-09-29 19:41:26 +00:00
Commented Sep 29, 2024 at 19:41
There is however an alternate syntax that you can use if you want to avoid this type of thing. Let me add it to my answer.

db48x
– db48x

2024-09-29 19:42:12 +00:00
Commented Sep 29, 2024 at 19:42

Add a comment |

NickD · Accepted Answer · 2024-09-30 02:41:16Z

Try this example:

#+begin_src elisp
  (let ((s "[[file://///abc/def/ghi][some link]]"))
    (replace-regexp-in-string  "\\[\\[[a-z0-9_]*:/*\\(/[^]]*\\)\\].*" (lambda(x) (match-string 1 x)) s))

#+end_src

The regex matches two literal opening square brackets (\\[\\[), followed by a character class that matches lower case letters, digits and underscore ([a-z0-9_]) repeated any number of times (*), followed by a literal colon (:), followed by any number of forward slashes (/*), (A) followed by a literal forward slash (/), followed by a character class that matches anything other than a closing square bracket ([^]]) repeated any number of times (*), (B) followed by a literal closing square bracket (\\]). The part between (A) and (B) is remembered because it is enclosed in funny parentheses (\\( and \\)) and can be retrieved on its own (here with (match-string 1 x)).

So the remembered part starts at the last forward slash before the a of the pathname, followed by everything that is not a closing square bracket and it stops before the closing square bracket.

Some editorializing: I didn't try to translate the PCRE you give to Emacs syntax: I started with the description in your original question. That avoids several headaches: I go cross-eyed when I look at any moderately complicated regexp; trying to understand two different regex syntaxes and translate between the two would make my head explode; the original PCRE may or may not work but I don't have to worry about that; the translation function may or may not work but I don't have to worry about that either; and if you understand what the regexp does, you have a much better chance of implementing small modifications without breaking it (e.g. if upper case letters are to be included, you can change the character class [a-z0-9_] to [a-zA-Z0-9_] and it's almost guaranteed that it's going to work - although there are better ways to describe the character class in general, this is probably good enough for your purpose).

Where regexps are involved, keeping it as simple as possible is always a win.

Stack Exchange Network

Is this PCRE regex convertible by pcre2el to Emacs regex syntax?

2 Answers 2

Your Answer

Hot Network Questions

Is this PCRE regex convertible by pcre2el to Emacs regex syntax?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions