0

I have a file like this. The format of the version is <version> space(s) dash space(s) date. I want to create a dictionary with 4.11.1 - 2020-02-25 as key and everything after that before 3.25.0 - 2019-01-01 as value and so on till the end of the file.

##################
Some texts

4.11.1 - 2020-02-25
-------------------

*some text

** Some more text

3.25.0 - 2019-01-01
-------------------

*some text

** Some more text

This is what I tried:

result ={}
matches = re.findall(r'([\d.]+[^\n]+)\s*(.*?)(?=\s*[\d.]+[^\n]+|$)', Text, re.S)
for match in matches:
    result[match[0]] = match[1]
 print(result)

It works for most of the cases. But it also prints these as keys :

.com/sth/sth/sth/6)

1.8.2 (https://github.com/sth/sth/sth/5)

1.8.1.

20160918 (see commands under 'some text')

. text text tex
3
  • What code/regex have you written to achieve this and what exactly is the issue with it? Commented Jun 30, 2020 at 21:40
  • edited my attempt in the question. Commented Jun 30, 2020 at 21:48
  • You could use 2 capturing groups ^(\d+(?:\.\d+)+ - \d{4}-\d{2}-\d{2})\r?\n((?:(?!\d+\.\d).*(?:\r?\n|$))*) regex101.com/r/dxEkrg/1 Commented Jun 30, 2020 at 21:53

1 Answer 1

3

You could use 2 capturing groups, and instead of using re.S use re.M

The pattern will capture in group 1 a version and space(s) dash space(s) using \d+(?:.\d+)+ +- + followed by a date like pattern \d{4}-\d{2}-\d{2}

Note that is does not validate a date itself. This page shows how you can make that date pattern more specific.

The capture group 2 matches all lines that do not start with 1+ digits, a dot and a digit. You can make that part more specific if you want.

^(\d+(?:\.\d+)+ +- +\d{4}-\d{2}-\d{2})\r?\n((?:(?!\d+\.\d).*(?:\r?\n|$))*)

Regex demo

import re

result ={}
Text = ("##################\n"
    "Some texts\n\n"
    "4.11.1 - 2020-02-25\n"
    "-------------------\n\n"
    "*some text\n\n"
    "** Some more text\n\n"
    "3.25.0 - 2019-01-01\n"
    "-------------------\n\n"
    "*some text\n\n"
    "** Some more text")
matches = re.findall(r'^(\d+(?:\.\d+)+ +- +\d{4}-\d{2}-\d{2})\r?\n((?:(?!\d+\.\d).*(?:\r?\n|$))*)', Text, re.M)

for match in matches:
    result[match[0]] = match[1]
print(result)

Output

{'4.11.1 - 2020-02-25': '-------------------\n\n*some text\n\n** Some more text\n\n', '3.25.0 - 2019-01-01': '-------------------\n\n*some text\n\n** Some more text'}
Sign up to request clarification or add additional context in comments.

3 Comments

Nice job! I think the only generalization teased in the question (but not made use of in the sample input) is that there might be multiple spaces between the dash and the version and between the dash and the date, respectively.
@DavidWierichs Sharp! Thank you, I have updated it. If there can be whitspace chars except newlines, it could also become [^\S\r\n]*-[^\S\r\n]*
Group 2 will always start from a string where the first char is Capital. Something like this. I tried to change (?:(?!\d+\.\d).*(?:\r?\n|$) part to ^(\d+(?:\.\d+)+ +- +\d{4}-\d{2}-\d{2})\r?\n((?:([A-Z][a-z]+).*(?:\r?\n|$))*) but didnt work. 4.11.1 - 2020-02-25 ------------------- Changed

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.