Scraping a site with java regex

Question

I would love to scrape the titles of the top 250 movies (https://www.imdb.com/chart/top/) for educational purposes.

I have tried a lot of things but I messed up at the end every time. Could you please help me scrape the titles with Java and regex?

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class scraping {

    public static void main (String args[]) {
        try {
            URL URL1=new URL("https://www.imdb.com/chart/top/");

            URLConnection URL1c=URL1.openConnection();
            BufferedReader br=new BufferedReader(new 
            InputStreamReader(URL1c.getInputStream(),"ISO8859_7"));

            String line;int lineCount=0;

            Pattern pattern = Pattern.compile("<td\\s+class=\"titleColumn\"[^>]*>"+ ".*?</a>");
            Matcher matcher = pattern.matcher(br.readLine());

            while(matcher.find()){
                System.out.println(matcher.group());
            }
        } catch (Exception e) {
            System.out.println("Exception: " + e.getClass() + ", Details: " + e.getMessage());
        }
    }
}

Thank you for your time.

@azro could you please help me with that? Cause my brain will explode at this time after so many tries. How can I print the content of td class=titleColumn ? — superuser
– superuser, Commented Dec 23, 2019 at 18:10
Check out regex101.com/r/XfiaB7/1 if you're adamant on making it work in regex — MonkeyZeus
– MonkeyZeus, Commented Dec 23, 2019 at 18:13

azro · Accepted Answer · 2019-12-23 21:24:38Z

Parsing Mode

To parse an XML or HTML content, a dedicated parser will always be easier than a regex, for HTML in Java there is Jsoup, you'll get your films very easily :

Document doc = Jsoup.connect("https://www.imdb.com/chart/top/").get();
Elements films = doc.select("td.titleColumn");
for (Element film : films) {
    System.out.println(film);
}

<td class="titleColumn"> 1. <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&amp;pf_rd_r=5BDHP4VZE8EGSEZC4ZSF&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=top&amp;ref_=chttp_tt_1" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Les évadés</a> <span class="secondaryInfo">(1994)</span> </td>
<td class="titleColumn"> 2. <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&amp;pf_rd_r=5BDHP4VZE8EGSEZC4ZSF&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=top&amp;ref_=chttp_tt_2" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">Le parrain</a> <span class="secondaryInfo">(1972)</span> </td>

To get the content only :

for (Element film : films) {
    System.out.println(film.getElementsByTag("a").text());
}

Les évadés
Le parrain
Le parrain, 2ème partie

Regex Mode

You were not reading the whole content of the website, also it's XML type so all is not on the same line, you can't find the beginning and the end of the balise on the same line, you may read all, and then use the regex, it gives something like this :

URL url = new URL("https://www.imdb.com/chart/top/");
InputStream is = url.openStream();

StringBuilder sb = new StringBuilder();
try (BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
    String line;
    while ((line = br.readLine()) != null) {
        sb.append(line);
    }
} catch (MalformedURLException e) {
    e.printStackTrace();
    throw new MalformedURLException("URL is malformed!!");
} catch (IOException e) {
    e.printStackTrace();
    throw new IOException();
}

// Full line
Pattern pattern = Pattern.compile("<td class=\"titleColumn\">.*?</td>");
String content = sb.toString();
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
    System.out.println(matcher.group());
}

// Title only
Pattern pattern = Pattern.compile("<td class=\"titleColumn\">.+?<a href=.+?>(.+?)</a>.+?</td>");
String content = sb.toString();
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
    System.out.println(matcher.group(1));
}

Thank you for your code. At the Regex mode, I have a problem with the results. I wanna make a list with the names only without the tags. How can I isolate the titles from HTML tags? Thank you again really much.

Nikolas · Accepted Answer · 2019-12-23 18:30:03Z

As the existing answer says, the Jsoup or other HTML parser should be used for sake of correctness.

I only complete your current solution if you want to use a similar approach for a more reasonable use-case. It cannot work, because you read only the first line from the buffer:

Matcher matcher = pattern.matcher(br.readLine);

Also the Regex pattern is wrong, because your solution seems is built to read 1 line-by-line and test that only line agasint the Regex. The source of the website shows that the content of the table row is spread across multiple lines.

The solution based on reading 1 line should use much simpler Regex (I am sorry, the example contains movie namess in my native language):

\" ?>([^<]+)<\/a>

An example of working code is:

try {
    URL URL1=new URL("https://www.imdb.com/chart/top/");

    URLConnection URL1c=URL1.openConnection();
    BufferedReader br=new BufferedReader(new
    InputStreamReader(URL1c.getInputStream(),"ISO8859_7"));

    String line;int lineCount=0;

    Pattern pattern = Pattern.compile("\" ?>([^<]+)<\\/a>"); // Compiled once

    br.lines()                       // Stream<String>
      .map(pattern::matcher)         // Stream<Matcher> 
      .filter(Matcher::find)         // Stream<Matcher> .. if Regex matches
      .limit(250)                    // Stream<Matcher> .. to avoid possible mess below
      .map(m -> m.group(1))          // String<String>  .. captured movie name
      .forEach(System.out::println); // Printed out

    } catch (Exception e) {
        System.out.println("Exception: " + e.getClass() + ", Details: " + e.getMessage());
    }

Note the following:

Regex is not suitable for this. Use a library built for this use-case.
My solution is an working example, but the performance is poor (Stream API, Regex pattern matching of each line)...
Solution like this doesn't guarantee a possible mess. The Regex can captrue more than intended.
The website content, CSS class names etc. might change in the future.

Collectives™ on Stack Overflow

Scraping a site with java regex

2 Answers 2

Parsing Mode

Regex Mode

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Parsing Mode

Regex Mode

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related