1

I developed a web scraper. The Web scraper uses 6 threads, each thread opens a web page, gets the text of an article, than writes (using a driver) each single word of the text in a mysql database.

During the execution of the program I get a java mysql java.lang.OutOfMemoryError: Java heap space. I installed Memory Analyzer on Eclipse and found that the problem is caused by the mysql driver connection: When I run this program, after 5 minutes the ram occupied by the driver is 6 MB, after another 5 minutes 200MB, after other 5 minutes 500Mb and then i get the java error heap space.

I don't understand why this happens.

Here is the code i use for the model (to access mysql DB)

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.ArrayList;
import java.util.List;

public class model {

    private Connection connect = null;

    public model(){
         try {

              Class.forName("com.mysql.jdbc.Driver");
              connect = DriverManager.getConnection("jdbc:mysql://localhost/system?user=keyword_tool&password=l0gripp0");

            } catch (Exception e) {
                System.out.println(e);
            }
    }

    public synchronized void insertCat(String parola, String categoria){

        try{
            PreparedStatement statement = connect.prepareStatement("insert into sostantivi (nome, categoria) values (?, ?)");
            statement.setString(1, parola);
            statement.setString(2, categoria);

            statement.executeUpdate();
            statement.close();

        } catch (Exception e){
            //System.out.println(e);
        }

    }

    public void closeDBConnection() {
        try {
            connect.close();
        } catch (Exception e) {
            System.out.println(e);
        }
    }
}

Each thread simply call the method insertCat and insert a word with a category in the database.

The Memory Analyzer plugin of Eclipse says:

enter image description here

enter image description here

2
  • How do you know closeDBConnection() is getting called? Put some logging in. Since you don't show that code, most likely is that model() constructor is called repeatedly but due to some flaw in your code connections are leaked or not closed. Commented Aug 27, 2013 at 9:09
  • I putted closeDbConnection() at the end of the scraping program. I create one model in the main class and each thread uses that model. After scraping what i need, i call that closeDBConnection on that single model. Is that wrong? Commented Aug 27, 2013 at 9:12

3 Answers 3

2

According to your comments, you only create one 'model' (that's a bad classname) and use it among 6 threads.

This is not particularly great design -- it's either performance-limited by synchronizing on a single DB connection (when you could use one per-thread), or runs into potential concurrency problems/ errors.

I only see one com.mysql.jdbc.JDBC4Connection in your heap dump.

This may be due to misleading display, or (which theory fits with your claimed "single model" approach) it is all full up with PreparedStatements or something.

In theory, these should be cached & reused -- in practice, you've got a problem. There are three steps to try:

  1. Update MySQL driver versions;
  2. Close and re-open the connection every 1000 or so statements;
  3. Give each thread it's own connection, or use a connection-pool.

This looks like some kind of problem with the Prepared Statement cache. Unless you can find some other bug with PreparedStmt or ResultSet handling in your code -- the prospect of which isn't obvious -- 1) and 2) are most likely to provide the solution/ or specific workaround.

Sign up to request clarification or add additional context in comments.

2 Comments

That is good. I used the fix number 2) and my program now rocks!!! The Kevin's solutions was not good because i get a drastically drop in performance (before i could download 6 pages/second then i could download 0,5 pages/second because open and close the connection for every write operation takes time). Thank you
Exactly. Though there's still probably a bug with the JDBC driver.. if you try a different version, it might be fixed.
0

if you create new model without destroy it, a new connect will be created, there are 2000000 model in your map, so you will have 2000000 connections.

you should extract all your connection code into a connection manager pool, and manager the connection yourself.

Comments

0

The code never closes the connection to the database.

Try creating/closing the connection in the insertCat method. Connections should be obtained and released as quickly as possible. The connection should only be open for the necessary amount of time to perform the persistence operation.

public class model {

    public synchronized void insertCat(String parola, String categoria){
        Connection connect = null;
        try{
            Class.forName("com.mysql.jdbc.Driver");
            connect = DriverManager.getConnection("jdbc:mysql://localhost/system?user=keyword_tool&password=l0gripp0");
            PreparedStatement statement = connect.prepareStatement("insert into sostantivi (nome, categoria) values (?, ?)");
            statement.setString(1, parola);
            statement.setString(2, categoria);

            statement.executeUpdate();
            statement.close();

        } catch (Exception e){
            //System.out.println(e);
        }finally{
           if(connect != null){
               try {
                  connect.close();
               } catch (Exception e) {
                  System.out.println(e);
               }
           }
        }

    }

}

5 Comments

You are introducing a potential NPE on connect.close() there, you have to check for null. In addition, it is good practice to close statements, resultsets as well, in the reverse order that you obtain them, even if it is usually done automatically on closing of connection. Ignoring exceptions is not a good idea either.
An obvious assumption & common error to jump to, but not apparently supported by the 'Eclipse Memory Analyzer' or 'Accumulated Objects' heap reports -- these indicate only one JDBC4Connection instance. He also states in comments he creates only one 'model' with one connection and shares this..
@ThomasW I don't think Kevin claimed there are multiple connection instances, he was just saying it's not closed. On closing the connection, other resources obtained from it should be freed as well, so I think it should fix the problem of statements piling up.
Still incorrect after edit -- returning connections ASAP only helps performance if they're pooled. Closing & reopening without a pool incurs connection-setup overhead (very expensive) on every statement.
@eis -- Kevin's pretty good, but this time he saw the "obvious" problem rather than the "actual" one. What he claimed is what he claimed -- not some general lather about the good of closing connections every stmt (which wouldn't true). Developers need to solve the "actual" problem, which is why my answer recommends to try closing & reopening -- but only every every 1000 statements. I also recommend 2 other possible areas to investigate. My solution is designed to solve/workaround the precise problem without unnecessary loss of performance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.