0

I am in need of building a function that can transform a string (which includes information about a tree structure of file directories) into a list (array) of separated directories. Example below:

[EXAMPLE INPUT] This is an example string that carries an encrypted information about the tree-like structure of directories:

"bucket[f1[a&b&f[]&ff1[a&b]]&f2[a&b&ff1[a&b]]&f3[ff1[a]]]"

Brackets [] are appearing only after folder name and show the range of the folder (empty ones mean an empty folder). All items are separated by &.

[EXPECTED OUTPUT] And this is what the output should look like given the example input:

[ '/bucket/f1/a',
  '/bucket/f1/b',
  '/bucket/f1/f/',   // <- directory to folder
  '/bucket/f1/ff1/a',
  '/bucket/f1/ff1/b',
  '/bucket/f2/a',
  '/bucket/f2/b',
  '/bucket/f2/ff1/a',
  '/bucket/f2/ff1/b',
  '/bucket/f3/ff1/a' ]

I have tried an recursive algorithm that uses string splits and regex to get the content of the brackets but I failed to distinguish the & on different "directory depths" using regex.

8
  • Why is /bucket/f1/b missing the trailing slash whereas /bucket/f1/f/ has one? Commented Jul 18, 2022 at 13:46
  • The /bucket/f1/b is missing the trailing slash because it is an example of a file directory whereas /bucket/f1/f/ is an example of a folder directory. This can be noticed in the string containing the whole structure: ...a&b&f[]&... Commented Jul 18, 2022 at 13:52
  • You need much more than .split(). You should look into parsers. Commented Jul 18, 2022 at 14:38
  • Voting to reopen because, while this question could use more explanation, something more than a casual read does show what's being asked, and it's an interesting problem. It also already has one excellent answer. Commented Jul 19, 2022 at 12:45
  • 1
    @ScottSauyet Agreed, the question is okay-ish. The amazing answer is really the only reason to keep this post open. Commented Jul 19, 2022 at 18:25

2 Answers 2

3

background

This is a non-trivial task but offers a great opportunity to learn about lexing and parsing. This is an essential step for all compilers of any programming language. Maybe this is your first time writing a language and interpreter. Maybe you didn't know your input string is a miniature language. Either way I hope you find this post useful. Please ask any follow-up questions if you are stuck.

lexer

Write lex to return a stream of token. We defined open, close, and, and identifier tokens.

function *lex(t) {
  for (const lexeme of t.match(/([a-z0-9]+|[\[\]&])/g)) {
    s: switch (lexeme) {
      case "[":
        yield { open: true }; break s
      case "]":
        yield { close: true }; break s
      case "&":
        yield { and: true }; break s
      default:
        yield { identifier: lexeme }; break s
    }
  }
}

Our program input is the string of syntax in your question -

const program =
  "bucket[f1[a&b&f[]&ff1[a&b]]&f2[a&b&ff1[a&b]]&f3[ff1[a]]]"

for (const v of lex(program))
  console.log(v)

Here's the tokens we get -

{identifier: "bucket"}
{open: true}
{identifier: "f1"}
{open: true}
{identifier: "a"}
{and: true}
{identifier: "b"}
{and: true}
{identifier: "f"}
{open: true}
{close: true}
{and: true}
{identifier: "ff1"}
{open: true}
{identifier: "a"}
{and: true}
{identifier: "b"}
{close: true}
{close: true}
{and: true}
{identifier: "f2"}
{open: true}
{identifier: "a"}
{and: true}
{identifier: "b"}
{and: true}
{identifier: "ff1"}
{open: true}
{identifier: "a"}
{and: true}
{identifier: "b"}
{close: true}
{close: true}
{and: true}
{identifier: "f3"}
{open: true}
{identifier: "ff1"}
{open: true}
{identifier: "a"}
{close: true}
{close: true}
{close: true}

parse

Next we have to parse the tokens and give them meaning. For each token, we check which token we have and modify the returned value (r) accordingly -

function parse(t, base = "") {
  const r = [[base]]
  let _0, _1
  for (const v of t) {
    s: switch (true) {
      case v.open:
        _0 = r.pop()
        r.push(_0)
        r.push([_0.pop()])
        break s
      case v.and:
        break s
      case v.close:
        _0 = r.pop()
        _1 = r.pop()
        _1.push(_0)
        r.push(_1)
        break s
      default:
        _0 = r.pop()
        _0.push(v.identifier)
        r.push(_0)
        break s
    }
  }
  return r[0]
}

We feed the output of lex to parse -

console.log(parse(lex(program)))

And we get back a tree of our file structure -

[
  "",
  [
    "bucket",
    [
      "f1",
      "a",
      "b",
      [
        "f"
      ],
      [
        "ff1",
        "a",
        "b"
      ]
    ],
    [
      "f2",
      "a",
      "b",
      [
        "ff1",
        "a",
        "b"
      ]
    ],
    [
      "f3",
      [
        "ff1",
        "a"
      ]
    ]
  ]
]

flatten

Finally we write a flatten function which flattens our file tree to a list of paths. Because your program asks for trailing slash on directories without children, we make a special call out in our code -

function *flatten(t) {
  switch (t?.constructor) {
    case Array:
      if (t.length === 1) // special case for empty dir
        yield [t[0], ""]
      else
        for (const child of t.slice(1))
          for (const path of flatten(child))
            yield [t[0], ...path]
      break
    default:
      yield [t]
  }
}

We take the output from parse and feed it to flatten -

for (const path of flatten(parse(lex(program))))
  console.log(path.join("/"))

Each path is like ["", "bucket", "f1", "a"] allowing the caller to manipulate them however they see fit. We construct a string, joining each segment with "/" -

/bucket/f1/a
/bucket/f1/b
/bucket/f1/f/
/bucket/f1/ff1/a
/bucket/f1/ff1/b
/bucket/f2/a
/bucket/f2/b
/bucket/f2/ff1/a
/bucket/f2/ff1/b
/bucket/f3/ff1/a

demo

Here's a fully functioning demo. Run it below to verify the results.

function *lex(t) {
  for (const lexeme of t.match(/([a-z0-9]+|[\[\]&])/g)) {
    s: switch (lexeme) {
      case "[":
        yield { open: true }; break s
      case "]":
        yield { close: true }; break s
      case "&":
        yield { and: true }; break s
      default:
        yield { identifier: lexeme }; break s
    }
  }
}

function parse(t, base = "") {
  const r = [[base]]
  let _0, _1
  for (const v of t) {
    s: switch (true) {
      case v.open:
        _0 = r.pop()
        r.push(_0)
        r.push([_0.pop()])
        break s
      case v.and:
        break s
      case v.close:
        _0 = r.pop()
        _1 = r.pop()
        _1.push(_0)
        r.push(_1)
        break s
      default:
        _0 = r.pop()
        _0.push(v.identifier)
        r.push(_0)
        break s
    }
  }
  return r[0]
}

function *flatten(t) {
  switch (t?.constructor) {
    case Array:
      if (t.length === 1) // special case for empty dir
        yield [t[0], ""]
      else
        for (const child of t.slice(1))
          for (const path of flatten(child))
            yield [t[0], ...path]
      break
    default:
      yield [t]
  }
}

const program =
  "bucket[f1[a&b&f[]&ff1[a&b]]&f2[a&b&ff1[a&b]]&f3[ff1[a]]]"

for (const path of flatten(parse(lex(program))))
  console.log(path.join("/"))
.as-console-wrapper { min-height: 100%; top: 0; }

changing the base path

Did you notice parse accepts a second argument, base? You can specify a base path and it will use this when interpreting all paths in the input program. Below we call parse with /usr/local as the base -

for (const path of flatten(parse(lex(program), "/usr/local")))
  console.log(path.join("/"))
/usr/local/bucket/f1/a
/usr/local/bucket/f1/b
/usr/local/bucket/f1/f/
/usr/local/bucket/f1/ff1/a
/usr/local/bucket/f1/ff1/b
/usr/local/bucket/f2/a
/usr/local/bucket/f2/b
/usr/local/bucket/f2/ff1/a
/usr/local/bucket/f2/ff1/b
/usr/local/bucket/f3/ff1/a

We could give a relative path too -

for (const path of flatten(parse(lex(program), ".")))
  console.log(path.join("/"))
./bucket/f1/a
./bucket/f1/b
./bucket/f1/f/
./bucket/f1/ff1/a
./bucket/f1/ff1/b
./bucket/f2/a
./bucket/f2/b
./bucket/f2/ff1/a
./bucket/f2/ff1/b
./bucket/f3/ff1/a

visualizing parse

How does parse turn a linear list of tokens into a hierarchical tree? Below we visualize the steps taken for each token -

token state
initial state [[.]]
{identifier: bucket} [[., bucket]]
{open} [[.], [bucket]]
{identifier: f1} [[.], [bucket, f1]]
{open} [[.], [bucket], [f1]]
{identifier: a} [[.], [bucket], [f1, a]]
{identifier: b} [[.], [bucket], [f1, a, b]]
{identifier: f} [[.], [bucket], [f1, a, b, f]]
{open} [[.], [bucket], [f1, a, b], [f]]
{close} [[.], [bucket], [f1, a, b, [f]]]
{identifier: ff1} [[.], [bucket], [f1, a, b, [f], ff1]]
{open} [[.], [bucket], [f1, a, b, [f]], [ff1]]
{identifier: a} [[.], [bucket], [f1, a, b, [f]], [ff1, a]]
{identifier: b} [[.], [bucket], [f1, a, b, [f]], [ff1, a, b]]
{close} [[.], [bucket], [f1, a, b, [f], [ff1, a, b]]]
{close} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]]]]
{identifier: f2} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]], f2]]
{open} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]]], [f2]]
{identifier: a} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]]], [f2, a]]
{identifier: b} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]]], [f2, a, b]]
{identifier: ff1} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]]], [f2, a, b, ff1]]
{open} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]]], [f2, a, b], [ff1]]
{identifier: a} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]]], [f2, a, b], [ff1, a]]
{identifier: b} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]]], [f2, a, b], [ff1, a, b]]
{close} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]]], [f2, a, b, [ff1, a, b]]]
{close} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]], [f2, a, b, [ff1, a, b]]]]
{identifier: f3} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]], [f2, a, b, [ff1, a, b]], f3]]
{open} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]], [f2, a, b, [ff1, a, b]]], [f3]]
{identifier: ff1} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]], [f2, a, b, [ff1, a, b]]], [f3, ff1]]
{open} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]], [f2, a, b, [ff1, a, b]]], [f3], [ff1]]
{identifier: a} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]], [f2, a, b, [ff1, a, b]]], [f3], [ff1, a]]
{close} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]], [f2, a, b, [ff1, a, b]]], [f3, [ff1, a]]]
{close} [[.], [bucket, [f1, a, b, [f], [ff1, a, b]], [f2, a, b, [ff1, a, b]], [f3, [ff1, a]]]]
{close} [[., [bucket, [f1, a, b, [f], [ff1, a, b]], [f2, a, b, [ff1, a, b]], [f3, [ff1, a]]]]]

homework

Our code above works but there are some scenarios we should address to make our program more robust.

lex -

  • How would we handle invalid syntax?
  • How would we include the character position offset for each token?

parse -

  • Why is it safe to ignore the & tokens?
  • How would we throw an error for unexpected [
  • How would we throw an error for unexpected ]
  • How would we limit the maximum depth of the generated file tree?
  • Where would we throw an error for tokens of invalid type?
Sign up to request clarification or add additional context in comments.

2 Comments

I love the solution and the pedagogy! See my answer for a completely different approach.
This is a well explained and detailed answer, thank you for sharing it 👍
2

Another approach

There is already a fantastic answer from Mulan. You can learn a great deal from that code, and I would suggest studying it carefully if you don't have much experience writing parsers.

But there are many ways to solve such problems, and here we examine an entirely different approach. It can't teach as much as Mulan's answer, but there is a useful technique that might be new to some.

We'll use a different example from yours for discussion, one with longer and more familiar node names. We want to convert this:

'root[bin&export[home[user1&user2&user3]]&lib&usr[ccs&lib]]'

into this:

[
  'root/bin',
  'root/export/home/user1',
  'root/export/home/user2',
  'root/export/home/user3',
  'root/lib',
  'root/usr/ccs',
  'root/usr/lib'
]

Mulan's answer wrote a parser for this mini-language. Here we do a series of transforms that ends in

'root/bin&root/export/home/user1&root/export/home/user2&root/export/home/user3&root/lib&root/usr/ccs&root/usr/lib'

and then simply split the result at every '&' to get your requested output. The trick is that we work with one bracket-pair at a time, always one that nests no further brackets inside it. When we've replaced that, we scan the new string for any further unnested brackets, and continue recursively until there are none to be found. The steps will look like this:

'root[bin&export[home[user1&user2&user3]]&lib&usr[ccs&lib]]'
'root[bin&export[home/user1&home/user2&home/user3]&lib&usr[ccs&lib]]'
'root[bin&export/home/user1&export/home/user2&export/home/user3&lib&usr[ccs&lib]]'
'root[bin&export/home/user1&export/home/user2&export/home/user3&lib&usr/ccs&usr/lib]'
'root/bin&root/export/home/user1&root/export/home/user2&root/export/home/user3&root/lib&root/usr/ccs&root/usr/lib'

Implementation

We do this using a regular expression that captures a pair of braces with no braces inside and the preceding string that applies to them:

root[bin&export[home[user1&user2&user3]]&lib&usr[ccs&lib]]
               |    |_________________|`--- closing brace (']')
               |    |        `--------+---- parts (['user1', 'user2', 'user3'])
               |    |`----------------+---- opening brace ('[')
               |____|                 |
               |  `-------------------+---- prefix ('home')
               |______________________|
                          `---------------- full match ('[home[user1&user2&user3]')

Then we replace the string we've found, joining each of the words inside the brackets with the prefix and a '/' to yield 'home/user1&home/user2&home/user3'. Then we recur on this new value, ending when there is no match.

The code looks like this:

const brackets = /([^\[\]\&]+)\[([^\[\]]*)\]/

const expand = (s) =>
  brackets .test (s)
    ? expand (s .replace (
        brackets, 
        (_, prefix, ps) => ps .split ('&') .map (p => prefix + '/' + p) .join ('&')
      ))
    : s

const parse = (s) =>
  expand (s) .split ('&') 


console .log (parse ('root[bin&export[home[user1&user2&user3]]&lib&usr[ccs&lib]]'))
console .log (parse ('bucket[f1[ab&f[]&ff1[a&b]]&f2[a&b&ff1[a&b]]&f3[ff1[a]]]'))
.as-console-wrapper {max-height: 100% !important; top: 0}

Notes

  • Our running example does not demonstrate the empty folders, but we show the example from the question to demonstrate that it does as required. No special processing turns out to be necessary for this.

  • This is recursive, and could run into recursion-depth problems if you're using this for large inputs. It would be a simple, mechanical process to turn this into a while loop.

  • Regular expressions are very powerful, but often difficult to understand. I personally find them quite worth the effort. You can find an explanation of this regex on Regex101.com. But when my first attempt at a regex solution failed, I wrote a mostly-equivalent solution without the regex. It might be instructive to compare that imperative code with this version. After writing it, I was able to recognize what I'd done wrong with my initial regular expression and write this cleaner version.

  • We do nothing here to capture bad input. It shouldn't be hard to do so here, but we leave it as an exercise for the reader. (Hint: it seems possible that any bad input would necessarily leave some brackets behind after the expansion.)

Update

Mulan (in comment below) noted that by double-calling the regular expression, we are adding inefficiencies. Of course that's right, although I haven't actually tested. We can do this with a single exec call with code like this:

const expand = (s, match = /([^\[\]\&]+)\[([^\[\]]*)\]/ .exec (s)) => match  
  ? expand (
      s .slice (0, match .index) 
      + match [2] .split ('&') .map (p => match [1] + '/' + p) .join ('&') 
      + s .slice (match .index + match [0] .length))
  : s

const parse = (s) =>
  expand (s) .split ('&') 


console .log (parse ('root[bin&export[home[user1&user2&user3]]&lib&usr[ccs&lib]]'))
console .log (parse ('bucket[f1[ab&f[]&ff1[a&b]]&f2[a&b&ff1[a&b]]&f3[ff1[a]]]'))
.as-console-wrapper {max-height: 100% !important; top: 0}

I have mixed feelings. Unless and until I test the relative performance, and find that this variant is substantially faster, it loses a lot to my mind due to having to deal with the output of exec and due to the index manipulations. At the very least, though, this is a very worthy contender.

7 Comments

terrific work with a super thin implementation. i was originally thinking of re.exec in a loop to take advantage of stateful regex, but opted for lex/parse because of the added challenge. that, and i've come to understand the regex engine is not so efficient. on that note, you could cut your execution time in half (approx) by removing brackets.test in your condition, which invokes the regex engine twice as much as you need. you could use mutual recursion to keep the elegance you have attained. if i find time later today, i'll try to measure the actual difference.
@Mulan: I added an update to replace the regex test and replace calls with a single exec and some index manipulations. I am much less happy with the code, though. The index manipulations and the very format of the exec results are an anathema to me. At some point I'll try to run a benchmark to help me decide whether the speed improvement is worth the code complexity. Did you have something else in mind? I don't see any good way to drop test without substituting exec for replace; test helps distinguish the base case from the recursive one.
Confirmed. A simple benchmark says that the new one is about 50% faster. We could probably further speed things up by replacing the recursion with a while loop.
Ahh, that mutually recursive version is beautiful! I agree. I was expecting the true parser to be faster, but thought it would only become significantly so with much larger inputs than your (nicely-constructed!) input. In any case, kudos on your excellent answer and the analysis of this one. (I do have to say, though, that while ((match = re.exec(s)) !== null) reminds me of bad old Java days! ;-) )
@Mulan: While looking at another question, I found a version that is -- while still only half the speed of your lexer-parser -- is substantially faster than the above, still with fairly simple code. But when I tried to extend it to cover more ground... it grew quite shaky.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.