Regex in java for recognizing specific text in file

Question

I'm trying to use split function to cut the text below in a way that only the title and the booktitle sections are returned. For example a sample text is like this

@inproceedings{DBLP:conf/crowncom/Chatzikokolakis15,
  author    = {Konstantinos Chatzikokolakis and
               Alexandros Kaloxylos and
               Panagiotis Spapis and
               Nancy Alonistioti and
               Chan Zhou and
               Josef Eichinger and
               {\"{O}}mer Bulakci},
  title     = {On the Way to Massive Access in 5G: Challenges and Solutions for Massive
               Machine Communications - (Invited Paper)},
  booktitle = {Cognitive Radio Oriented Wireless Networks - 10th International Conference,
               {CROWNCOM} 2015, Doha, Qatar, April 21-23, 2015, Revised Selected
               Papers},
  pages     = {708--717},
  year      = {2015},
  crossref  = {DBLP:conf/crowncom/2015},
  url       = {http://dx.doi.org/10.1007/978-3-319-24540-9_58},
  doi       = {10.1007/978-3-319-24540-9_58},
  timestamp = {Wed, 14 Oct 2015 08:42:42 +0200},
  biburl    = {http://dblp.uni-trier.de/rec/bib/conf/crowncom/Chatzikokolakis15},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

and I want as an exit these 2 blocks as separate strings:

booktitle = {Cognitive Radio Oriented Wireless Networks - 10th International Conference,
               {CROWNCOM} 2015, Doha, Qatar, April 21-23, 2015, Revised Selected
               Papers}


title     = {On the Way to Massive Access in 5G: Challenges and Solutions for Massive
               Machine Communications - (Invited Paper)}

Can anyone please help me with the regular expression that applies in Java and especially in the split method?

"especially in the split method" why do you want to use split? It would require from you to describe everything you don't want. Maybe focus on things you would like to find instead and use Pattern/Matcher combination. — Pshemo
– Pshemo, Commented Nov 7, 2015 at 3:26
Anyway is format used here fixed or can it change? Can we use fact that sections describing attributes starts with exactly two spaces? — Pshemo
– Pshemo, Commented Nov 7, 2015 at 3:29
It looks like you need to be able to accept braces nested to any depth within the matched sections, like {{{{}}}}... that actually can't be done quite properly with a regular expression — Matt Timmermans
– Matt Timmermans, Commented Nov 7, 2015 at 3:32
@MattTimmermans true, but if we won't try to do everything in one regex and ignore for now { and } we can focus on number of spaces at start of each line to separating each sections. Rest is simply checking first word in section (title or booktitle). But since I really don't like to correct problems which could be avoided by having more informations about format used in text I will wait until OP confirmations that we can actually rely on number of spaces. — Pshemo
– Pshemo, Commented Nov 7, 2015 at 3:37
Besides what everyone said, what have you tried to solve your problem? Or are waiting to us just do it for you? — Jorge Campos
– Jorge Campos, Commented Nov 7, 2015 at 3:41

Andreas · Accepted Answer · 2015-11-07 04:13:36Z

This Java regex can find your two subtexts:

(?s)(?<=[\r\n]+  )(?:title|booktitle) += \\{.*?\\}(?=,[\r\n]+  \\w|[\r\n]+\\})

You can then use startsWith() to find which subtext is which.

Test

String input = "@inproceedings{DBLP:conf/crowncom/Chatzikokolakis15,\r\n" +
               "  author    = {Konstantinos Chatzikokolakis and\r\n" +
               "               Alexandros Kaloxylos and\r\n" +
               "               Panagiotis Spapis and\r\n" +
               "               Nancy Alonistioti and\r\n" +
               "               Chan Zhou and\r\n" +
               "               Josef Eichinger and\r\n" +
               "               {\"{O}}mer Bulakci},\r\n" +
               "  title     = {On the Way to Massive Access in 5G: Challenges and Solutions for Massive\r\n" +
               "               Machine Communications - (Invited Paper)},\r\n" +
               "  booktitle = {Cognitive Radio Oriented Wireless Networks - 10th International Conference,\r\n" +
               "               {CROWNCOM} 2015, Doha, Qatar, April 21-23, 2015, Revised Selected\r\n" +
               "               Papers},\r\n" +
               "  pages     = {708--717},\r\n" +
               "  year      = {2015},\r\n" +
               "  crossref  = {DBLP:conf/crowncom/2015},\r\n" +
               "  url       = {http://dx.doi.org/10.1007/978-3-319-24540-9_58},\r\n" +
               "  doi       = {10.1007/978-3-319-24540-9_58},\r\n" +
               "  timestamp = {Wed, 14 Oct 2015 08:42:42 +0200},\r\n" +
               "  biburl    = {http://dblp.uni-trier.de/rec/bib/conf/crowncom/Chatzikokolakis15},\r\n" +
               "  bibsource = {dblp computer science bibliography, http://dblp.org}\r\n" +
               "}\r\n";
String regex = "(?s)(?<=[\r\n]+  )(?:title|booktitle) += \\{.*?\\}(?=,[\r\n]+  \\w|[\r\n]+\\})";
Matcher m = Pattern.compile(regex).matcher(input);
while (m.find())
    System.out.println(m.group());

Output

title     = {On the Way to Massive Access in 5G: Challenges and Solutions for Massive
               Machine Communications - (Invited Paper)}
booktitle = {Cognitive Radio Oriented Wireless Networks - 10th International Conference,
               {CROWNCOM} 2015, Doha, Qatar, April 21-23, 2015, Revised Selected
               Papers}

What puzzles me is how Pattern is able to compile (?<=[\r\n]+ ) without complains about lack of obvious maximum length in look-behind... This won't work for instance with (?<=_f+) but works fine for (?<=f+) or (?<=f+_).

user2329125 · Accepted Answer · 2015-11-07 03:44:52Z

0

Don't know if that helps you.

String bibtex  = "<your giant string>";

for ( String s : bibtex.split("}\\s*,") )
{
    if ( s.trim().startsWith("booktitle") ||  s.trim().startsWith("title") )
        System.out.println(s);
}

answered Nov 7, 2015 at 3:44

user2329125

1 Comment

Pshemo Over a year ago

While it can work in this case I would avoid relying on splitting on }, since we can't be sure that it won't be any nested inside of data we want to read like {foo{bar},baz}, in which case we would split after bar and baz.

Pshemo · Accepted Answer · 2015-11-07 03:49:58Z

0

Assuming that format of text is exactly same as you posted you could:

remove first and last line
split it on two spaces which are placed at start of line and have no space after them (you will need ^ and multiline flag to let it represent start of line, and to test part after thing on which you want to split but not including it in delimiter take a look at look-ahead mechanism).
iterate over all sections you obtained from previous split and print ones which start with title or booktitle

answered Nov 7, 2015 at 3:49

Pshemo

125k26 gold badges194 silver badges280 bronze badges

Collectives™ on Stack Overflow

Regex in java for recognizing specific text in file

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related