regex for python (version number - date format)

Question

I have a file like this. The format of the version is <version> space(s) dash space(s) date. I want to create a dictionary with 4.11.1 - 2020-02-25 as key and everything after that before 3.25.0 - 2019-01-01 as value and so on till the end of the file.

##################
Some texts

4.11.1 - 2020-02-25
-------------------

*some text

** Some more text

3.25.0 - 2019-01-01
-------------------

*some text

** Some more text

This is what I tried:

result ={}
matches = re.findall(r'([\d.]+[^\n]+)\s*(.*?)(?=\s*[\d.]+[^\n]+|$)', Text, re.S)
for match in matches:
    result[match[0]] = match[1]
 print(result)

It works for most of the cases. But it also prints these as keys :

.com/sth/sth/sth/6)

1.8.2 (https://github.com/sth/sth/sth/5)

1.8.1.

20160918 (see commands under 'some text')

. text text tex

What code/regex have you written to achieve this and what exactly is the issue with it? — ForceBru
– ForceBru, Commented Jun 30, 2020 at 21:40
You could use 2 capturing groups ^(\d+(?:\.\d+)+ - \d{4}-\d{2}-\d{2})\r?\n((?:(?!\d+\.\d).*(?:\r?\n|$))*) regex101.com/r/dxEkrg/1 — The fourth bird
– The fourth bird, Commented Jun 30, 2020 at 21:53

The fourth bird · Accepted Answer · 2020-06-30 22:18:13Z

3

You could use 2 capturing groups, and instead of using re.S use re.M

The pattern will capture in group 1 a version and space(s) dash space(s) using \d+(?:.\d+)+ +- + followed by a date like pattern \d{4}-\d{2}-\d{2}

Note that is does not validate a date itself. This page shows how you can make that date pattern more specific.

The capture group 2 matches all lines that do not start with 1+ digits, a dot and a digit. You can make that part more specific if you want.

^(\d+(?:\.\d+)+ +- +\d{4}-\d{2}-\d{2})\r?\n((?:(?!\d+\.\d).*(?:\r?\n|$))*)

Regex demo

import re

result ={}
Text = ("##################\n"
    "Some texts\n\n"
    "4.11.1 - 2020-02-25\n"
    "-------------------\n\n"
    "*some text\n\n"
    "** Some more text\n\n"
    "3.25.0 - 2019-01-01\n"
    "-------------------\n\n"
    "*some text\n\n"
    "** Some more text")
matches = re.findall(r'^(\d+(?:\.\d+)+ +- +\d{4}-\d{2}-\d{2})\r?\n((?:(?!\d+\.\d).*(?:\r?\n|$))*)', Text, re.M)

for match in matches:
    result[match[0]] = match[1]
print(result)

Output

{'4.11.1 - 2020-02-25': '-------------------\n\n*some text\n\n** Some more text\n\n', '3.25.0 - 2019-01-01': '-------------------\n\n*some text\n\n** Some more text'}

edited Jun 30, 2020 at 22:18

answered Jun 30, 2020 at 21:57

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

David Wierichs Over a year ago

Nice job! I think the only generalization teased in the question (but not made use of in the sample input) is that there might be multiple spaces between the dash and the version and between the dash and the date, respectively.

The fourth bird Over a year ago

@DavidWierichs Sharp! Thank you, I have updated it. If there can be whitspace chars except newlines, it could also become [^\S\r\n]*-[^\S\r\n]*

urawesome Over a year ago

Group 2 will always start from a string where the first char is Capital. Something like this. I tried to change (?:(?!\d+\.\d).*(?:\r?\n|$) part to ^(\d+(?:\.\d+)+ +- +\d{4}-\d{2}-\d{2})\r?\n((?:([A-Z][a-z]+).*(?:\r?\n|$))*) but didnt work. 4.11.1 - 2020-02-25 ------------------- Changed

Collectives™ on Stack Overflow

regex for python (version number - date format)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related