0

This query against the Wikidata SPARQL endpoint returns the Wikitext content of the first 50 files in the Wikimedia Commons category "1930s photographs in Auckland Museum". For each file, I want to extract several pieces of data from that content.

Working with just one file, File:("Ultimate" stall) (AM 79483-1).jpg, as an example, the content looks like this:

== {{int:filedesc}} ==
{{Artwork 
| description = {{en|1=At the equestrian show. A man stands in front of a stall selling radios.}} 
| title = ("Ultimate" stall) 
| artist = {{Creator:Tudor Washington Collins}} 
| date = 1938 
| place of creation = 
| source = {{Images from Auckland Museum|section=library|object=photography|id=79483}}
           [https://api.aucklandmuseum.com/id/media/p/806abf5c0952f972e56bc95fed841c5031bcb9ff Photo] 
| accession number = 79483 (object number) 
| object type = 
| technique = Silver gelatin dry plate 
| dimensions = 
| institution = {{Institution:Auckland War Memorial Museum}} 
| permission = This image has been released as "CCBY" by Auckland Museum. For details refer to the
               [[Commons:Batch_uploading/AucklandMuseumCCBY|Commons project page]]. 
| credit line = 
| notes = 
| other_versions = <gallery> ("Ultimate" stall) (AM 79483-2).jpg </gallery>
}}

== {{int:license-header}} ==
{{CC-BY-4.0|1=Auckland Museum}}
[[Category:Images uploaded by Fæ]] [[Category:1930s photographs in Auckland Museum]]
[[Category:Tudor Washington Collins]] [[Category:Radio in Auckland Museum]]
[[Category:Images from Auckland Museum]]

I'm interested in these 3 values in the source parameter. I've tried to parse this content with regex; this is the the first expression I wrote, which deals with the bulk of the Wikitext:

^(?>.+{{Images from Auckland Museum\|)(.*?)(?>}}.+)$

I used regex101.com to write this, and from what I can tell it says:

  1. Find (and discard) everything up to the string {{Images from Auckland Museum|, including that string. (This was the most obvious delimiter I could think of).
  2. Capture everything that occurs afterward.
  3. Find (and discard) everything from the first occurrence of a pair of right curly brackets (}}) to the end.

This leaves only the portion I'm interested in:

section=library|object=photography|id=79483

So far, so good.

I then created another regex101.com session to work on just that portion, with this expression:

(?>.*?\=)(.*)(?>.*?\|)(?>.*?\=)(.*)(?>.*?\|)(?>.*?\=)(.*)

From what I can tell, this expression says:

  1. Find (and discard) everything up to, and including, the first =.
  2. Capture everything after that, up to, but not including, the first | …and repeats three times, one for each capture group, giving me the three data points I want.

It seems to work: Regex101.com evaluation of the syntax "(?>.?=)(.)(?>.?|)(?>.?=)(.)(?>.?|)(?>.?=)(.)" against the string "section=library|object=photography|id=79483"

My questions are these:

  1. How can I combine these regular expressions? Simply slotting the second into the first in place of its (.*?) does not appear to work.
  2. Given that regex allows recursion, is there a better (i.e., more efficient) way to write the second expression? (Would the SPARQL endpoint/language allow this?)
  3. Is there any way in the first expression to simply say, after obtaining the first capture group, something like, "I've got what I want; stop"—and would there be any efficiency gain in doing so?

Thanks in advance.

2
  • 1
    I don't understand why do you think you need recursion. Couldn't you just use {{Images from Auckland Museum\|[^=]*=(.*?)\|[^=]*=(.*?)\|[^=]*=(.*?)}}? See regex101.com/r/dGU208/1 Commented Sep 20, 2021 at 8:07
  • @horcrux Wow, yep, thank you! That's perfect. I had no idea you could simply write bits of the string in like that at the start of the expression; I thought everything had to be in brackets to denote groups that you do or don't want to capture. Thank you very much! Commented Sep 20, 2021 at 10:01

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.