0

I am trying to make some sort of Lexer in Java using regex for a custom markdown "language" I'm making, it's my first time working with this stuff so a little lost on a few things.
An example of a possible syntax in it is:
Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!
I was able to capture a few things, for example I'm using (?<hex><#\w+>) to capture the "hex" and (?<action>\[[^]]*]\([^]]*\)) to get the entire "action" block.
My problem is being able to capture it all together, like, how to combine it all. For example the lexer needs to output something like:

TEXT - Some
HEX - <#000000>
TEXT - *text*
ACTION - [<#ffffff>Some more](action: Other <#gradient>text)
TEXT - and **finally** some more
HEX - <#000>
TEXT - text!

I'll handle the bold and italic later.
Would love just some suggestions on how to combine all of them!

4
  • You could create (named) capturing groups for all the parts ^(.*?) (?<hex1><#\w+>)(\*[^*]*\*) (?<action>\[[^]]*]\([^]]*\)) (.*?) (?<hex2><#\w+>)(.*)$ regex101.com/r/iocBCR/1 Commented Jul 29, 2020 at 16:38
  • @Thefourthbird Hey thank you! Tough that means it needs to be exactly like the example I sent right? For example changing the order or adding new tags won't be recognized. Commented Jul 29, 2020 at 17:26
  • The current pattern is depending on all the parts being present in that order. You might use an alternation to make it more flexible (?<hex><#\w+>)|(?<action>\[[^]]*]\([^]]*\))|(?<text>[\w!* ]+) regex101.com/r/JWHNP9/1 Commented Jul 29, 2020 at 18:56
  • 1
    @Thefourthbird Oh okay, I'm starting to understand now thank you! If you wanna post that as an answer. Commented Jul 29, 2020 at 19:08

2 Answers 2

2

One option could be using an alternation matching each of the separate parts, and for the text part use for example a character class [\w!* ]+

In Java, you could check for the name of the capturing group.

(?<hex><#\w+>)|(?<action>\[[^]]*]\([^]]*\))|(?<text>[\w!* ]+)

Explanation

  • (?<hex><#\w+>) Capture group hex, match # and 1+ word chars
  • | Or
  • (?<action> Capture group action
    • \[[^]]*]\([^]]*\) Match [...] followed by (...)
  • ) Close group
  • | Or
  • (?<text>[\w!* ]+) Capture group text, match 1+ times any char listed in the character class

Regex demo | Java demo

Example code:

String regex = "(?<hex><#\\w+>)|(?<action>\\[[^]]*]\\([^]]*\\))|(?<text>[\\w!* ]+)";
String string = "Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    if (matcher.group("hex") != null) {
        System.out.println("HEX - " + matcher.group("hex"));    
    }
    if (matcher.group("text") != null) {
        System.out.println("TEXT - " + matcher.group("text"));  
    }
    if (matcher.group("action") != null) {
        System.out.println("ACTION - " + matcher.group("action"));  
    }
}

Output

TEXT - Some 
HEX - <#000000>
TEXT - *text* 
ACTION - [<#ffffff>Some more](action: Other <#gradient>text)
TEXT -  and **finally** some more 
HEX - <#000>
TEXT - text!
Sign up to request clarification or add additional context in comments.

Comments

0

You can achieve this using Regex- Capturing groups like this ^(.*?) (?<hex1><#\w+>)(\*[^*]*\*) (?<action>\[[^]]*]\([^]]*\)) (.*?) (?<hex2><#\w+>)(.*)$ To get a better understanding refer this Click here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.