3

I have a need to do some data-transformation for data load compatibility. The nested key:value pairs need to be flattened and have their group id prepended to each piece of child data.

I've been trying to understand the page at Repeating a Capturing Group vs. Capturing a Repeated Group but can't seem to wrap my head around it.

My expression so far:

"(?'group'[\w]+)": {\n((\s*"(?'key'[^"]+)": "(?'value'[^"]+)"(?:,\n)?)+)\n},?

Working sample: https://regex101.com/r/Wobej7/1

I'm aware that using 1 or more intermediate steps would simplify the process but at this point I want to know if it's even possible.

Source Data Example:

"g1": {
  "k1": "v1",
  "k2": "v2",
  "k3": "v3"
},
"g2": {
  "k4": "v4",
  "k5": "v5",
  "k6": "v6"
},
"g3": {
  "k7": "v7",
  "k8": "v8",
  "k9": "v9"
}

Desired transformation:

{"g1","k1","v1"},
{"g1","k2","v2"},
{"g1","k3","v3"},
{"g2","k4","v4"},
{"g2","k5","v5"},
{"g2","k6","v6"},
{"g3","k7","v7"},
{"g3","k8","v8"},
{"g3","k9","v9"}
8
  • Where are you using the regex? If in Notepad++, you might use ^("(\w+)":\h*{\h*)(?:\R\h+"(\w+)":\h*"(\w+)",?|\s*\}(?:,\R)?) and replace with (?{3}\{"$2","$3","$4"\},\n$1:), but you will have to click Replace all several times. Commented Mar 10, 2018 at 18:57
  • I've been using it in Sublime Text. I tested your solution in N++ and while it solves for the end solution, it doesn't capture more than one child at a time. The reason I posted on Stack Overflow is really to see if someone can help me understand repeating nested capture groups but thank you! Commented Mar 10, 2018 at 19:17
  • As I'm aware it's not possible in one single step. At least you have to go with two regular expressions which means one more mouse click. Commented Mar 10, 2018 at 19:18
  • I'm not sure I see where it could be done in even 2 steps. One thing to clarify is that the groups in the real application do not have an even number of data, it's all different from 1-15 k:v pairs. Commented Mar 10, 2018 at 19:27
  • @Rumpled In SublimeText, you still might get it to work, perhaps, with 2 steps. However, you should precise the format. What is the real format of the input string? Regarding repeated capturing groups, you cannot work with them in text editors and you can only work with them in few programming languages. Commented Mar 10, 2018 at 21:01

1 Answer 1

0

TL; DR

Step 1

Search for:

("[^"]+"):\s*{[^}]*},?\K

Replace with \1

Live demo

Step 2

Search for:

(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)

Replace with:

{\3,\1,\2}\4\n

Live demo

Whole philosophy

This is not going to be a one-liner regex solution for different reasons. The most important one is we can neither store a part of a match for later referring nor are able to do infinite lookbehinds in PCRE. But fortunately most of similar problems could be done in two steps.

Very first step should be moving group name to end of {...} block. This way we can have group name each time we want to transform our matches into a single line output.

("[^"]+"):\s*{[^}]*},?\K
  • ( Start of capturing group #1
    • "[^"]+" Match a group name
  • ) End of CG #1
  • :\s*{ Group name should precede bunch of other characters
  • [^}]*},? We have to go further up to end of block
  • \K Throw away every thing matched so far

We have our group name held in first capturing group and have to replace whole match with it:

\1

Now a block like this:

"g1": {
  .
  .
  .
},

Appears like this one:

"g1": {
  .
  .
  .
},"g1"

Next step is to match key:value pairs of each block beside capturing recent added group name at the end of block.

(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)
  • (?: Start of a non-capturing group
    • "[^"]+" Try to match a group name
    • :\s*{ A group name should come after bunch of other characters
    • | Or
    • \G(?!\A) Continue from previous match
  • ) End of NCG
  • \s*("[^"]+"):\s*((?1)) Then try to match and capture a key:value pair
  • (?=[^}]*},?((?1))) Simultaneously match and capture group name at the end of block
  • (?|(,)|\s*}(,?).*\R*) Match remaining characters such as commas, brace or newlines

This way in each single successful try of regex engine we have four captured data that their order is the key:

{\3,\1,\2}\4\n
  • \3 Group name (that one added at the end of block)
  • \1 Key
  • \2 Value
  • \4 Comma (may be there or may not)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.