0

I'm trying to use split function to cut the text below in a way that only the title and the booktitle sections are returned. For example a sample text is like this

@inproceedings{DBLP:conf/crowncom/Chatzikokolakis15,
  author    = {Konstantinos Chatzikokolakis and
               Alexandros Kaloxylos and
               Panagiotis Spapis and
               Nancy Alonistioti and
               Chan Zhou and
               Josef Eichinger and
               {\"{O}}mer Bulakci},
  title     = {On the Way to Massive Access in 5G: Challenges and Solutions for Massive
               Machine Communications - (Invited Paper)},
  booktitle = {Cognitive Radio Oriented Wireless Networks - 10th International Conference,
               {CROWNCOM} 2015, Doha, Qatar, April 21-23, 2015, Revised Selected
               Papers},
  pages     = {708--717},
  year      = {2015},
  crossref  = {DBLP:conf/crowncom/2015},
  url       = {http://dx.doi.org/10.1007/978-3-319-24540-9_58},
  doi       = {10.1007/978-3-319-24540-9_58},
  timestamp = {Wed, 14 Oct 2015 08:42:42 +0200},
  biburl    = {http://dblp.uni-trier.de/rec/bib/conf/crowncom/Chatzikokolakis15},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

and I want as an exit these 2 blocks as separate strings:

booktitle = {Cognitive Radio Oriented Wireless Networks - 10th International Conference,
               {CROWNCOM} 2015, Doha, Qatar, April 21-23, 2015, Revised Selected
               Papers}


title     = {On the Way to Massive Access in 5G: Challenges and Solutions for Massive
               Machine Communications - (Invited Paper)}

Can anyone please help me with the regular expression that applies in Java and especially in the split method?

5
  • 1
    "especially in the split method" why do you want to use split? It would require from you to describe everything you don't want. Maybe focus on things you would like to find instead and use Pattern/Matcher combination. Commented Nov 7, 2015 at 3:26
  • Anyway is format used here fixed or can it change? Can we use fact that sections describing attributes starts with exactly two spaces? Commented Nov 7, 2015 at 3:29
  • It looks like you need to be able to accept braces nested to any depth within the matched sections, like {{{{}}}}... that actually can't be done quite properly with a regular expression Commented Nov 7, 2015 at 3:32
  • @MattTimmermans true, but if we won't try to do everything in one regex and ignore for now { and } we can focus on number of spaces at start of each line to separating each sections. Rest is simply checking first word in section (title or booktitle). But since I really don't like to correct problems which could be avoided by having more informations about format used in text I will wait until OP confirmations that we can actually rely on number of spaces. Commented Nov 7, 2015 at 3:37
  • Besides what everyone said, what have you tried to solve your problem? Or are waiting to us just do it for you? Commented Nov 7, 2015 at 3:41

3 Answers 3

1

This Java regex can find your two subtexts:

(?s)(?<=[\r\n]+  )(?:title|booktitle) += \\{.*?\\}(?=,[\r\n]+  \\w|[\r\n]+\\})

You can then use startsWith() to find which subtext is which.

Test

String input = "@inproceedings{DBLP:conf/crowncom/Chatzikokolakis15,\r\n" +
               "  author    = {Konstantinos Chatzikokolakis and\r\n" +
               "               Alexandros Kaloxylos and\r\n" +
               "               Panagiotis Spapis and\r\n" +
               "               Nancy Alonistioti and\r\n" +
               "               Chan Zhou and\r\n" +
               "               Josef Eichinger and\r\n" +
               "               {\"{O}}mer Bulakci},\r\n" +
               "  title     = {On the Way to Massive Access in 5G: Challenges and Solutions for Massive\r\n" +
               "               Machine Communications - (Invited Paper)},\r\n" +
               "  booktitle = {Cognitive Radio Oriented Wireless Networks - 10th International Conference,\r\n" +
               "               {CROWNCOM} 2015, Doha, Qatar, April 21-23, 2015, Revised Selected\r\n" +
               "               Papers},\r\n" +
               "  pages     = {708--717},\r\n" +
               "  year      = {2015},\r\n" +
               "  crossref  = {DBLP:conf/crowncom/2015},\r\n" +
               "  url       = {http://dx.doi.org/10.1007/978-3-319-24540-9_58},\r\n" +
               "  doi       = {10.1007/978-3-319-24540-9_58},\r\n" +
               "  timestamp = {Wed, 14 Oct 2015 08:42:42 +0200},\r\n" +
               "  biburl    = {http://dblp.uni-trier.de/rec/bib/conf/crowncom/Chatzikokolakis15},\r\n" +
               "  bibsource = {dblp computer science bibliography, http://dblp.org}\r\n" +
               "}\r\n";
String regex = "(?s)(?<=[\r\n]+  )(?:title|booktitle) += \\{.*?\\}(?=,[\r\n]+  \\w|[\r\n]+\\})";
Matcher m = Pattern.compile(regex).matcher(input);
while (m.find())
    System.out.println(m.group());

Output

title     = {On the Way to Massive Access in 5G: Challenges and Solutions for Massive
               Machine Communications - (Invited Paper)}
booktitle = {Cognitive Radio Oriented Wireless Networks - 10th International Conference,
               {CROWNCOM} 2015, Doha, Qatar, April 21-23, 2015, Revised Selected
               Papers}
Sign up to request clarification or add additional context in comments.

1 Comment

What puzzles me is how Pattern is able to compile (?<=[\r\n]+ ) without complains about lack of obvious maximum length in look-behind... This won't work for instance with (?<=_f+) but works fine for (?<=f+) or (?<=f+_).
0

Don't know if that helps you.

String bibtex  = "<your giant string>";

for ( String s : bibtex.split("}\\s*,") )
{
    if ( s.trim().startsWith("booktitle") ||  s.trim().startsWith("title") )
        System.out.println(s);
}

1 Comment

While it can work in this case I would avoid relying on splitting on }, since we can't be sure that it won't be any nested inside of data we want to read like {foo{bar},baz}, in which case we would split after bar and baz.
0

Assuming that format of text is exactly same as you posted you could:

  1. remove first and last line
  2. split it on two spaces which are placed at start of line and have no space after them (you will need ^ and multiline flag to let it represent start of line, and to test part after thing on which you want to split but not including it in delimiter take a look at look-ahead mechanism).
  3. iterate over all sections you obtained from previous split and print ones which start with title or booktitle

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.