26

I am trying to insert 100,000 rows in a MYSQL table under 5 seconds using Hibernate(JPA). I have tried every trick hibernate offers and still can not do better than 35 seconds.

1st optimisation : I started with IDENTITY sequence generator which was resulting in 60 seconds to insert. I later abandoned the sequence generator and started assigning the @Id field myself by reading the MAX(id) and using AtomicInteger.incrementAndGet() to assign fields myself. That reduced the insert time to 35 seconds.

2nd optimisation : I enabled batch inserts, by adding

<prop key="hibernate.jdbc.batch_size">30</prop> <prop key="hibernate.order_inserts">true</prop> <prop key="hibernate.current_session_context_class">thread</prop> <prop key="hibernate.jdbc.batch_versioned_data">true</prop>

to the configuration. I was shocked to find that batch inserts did absolutely nothing to decrease insert time. It was still 35 seconds!

Now, I am thinking about trying to insert using multiple threads. Anyone has any pointers? Should I have chosen MongoDB?

Below is my configuration: 1. Hibernate configuration `

<bean id="entityManagerFactoryBean" class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
        <property name="dataSource" ref="dataSource" />
        <property name="packagesToScan" value="com.progresssoft.manishkr" />
        <property name="jpaVendorAdapter">
            <bean class="org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter" />
        </property>
        <property name="jpaProperties">
            <props>
                <prop key="hibernate.hbm2ddl.auto">${hibernate.hbm2ddl.auto}</prop>
                <prop key="hibernate.dialect">${hibernate.dialect}</prop>
                <prop key="hibernate.show_sql">${hibernate.show_sql}</prop>
                <prop key="hibernate.format_sql">${hibernate.format_sql}</prop>
                <prop key="hibernate.jdbc.batch_size">30</prop>
                <prop key="hibernate.order_inserts">true</prop>
                <prop key="hibernate.current_session_context_class">thread</prop>
                <prop key="hibernate.jdbc.batch_versioned_data">true</prop>
            </props>
        </property>
    </bean>

    <bean class="org.springframework.jdbc.datasource.DriverManagerDataSource"
          id="dataSource">
        <property name="driverClassName" value="${database.driver}"></property>
        <property name="url" value="${database.url}"></property>
        <property name="username" value="${database.username}"></property>
        <property name="password" value="${database.password}"></property>
    </bean>

    <bean id="transactionManager" class="org.springframework.orm.jpa.JpaTransactionManager">
        <property name="entityManagerFactory" ref="entityManagerFactoryBean" />
    </bean>



    <tx:annotation-driven transaction-manager="transactionManager" />

`

  1. Entity configuration :

`

@Entity
@Table(name = "myEntity")
public class MyEntity {

    @Id
    private Integer id;

    @Column(name = "deal_id")
    private String dealId;

    ....
    ....

    @Temporal(TemporalType.TIMESTAMP)
    @Column(name = "timestamp")
    private Date timestamp;

    @Column(name = "amount")
    private BigDecimal amount;

    @OneToOne(cascade = CascadeType.ALL)
    @JoinColumn(name = "source_file")
    private MyFile sourceFile;

    public Deal(Integer id,String dealId, ....., Timestamp timestamp, BigDecimal amount, SourceFile sourceFile) {
        this.id = id;
        this.dealId = dealId;
        ...
        ...
        ...
        this.amount = amount;
        this.sourceFile = sourceFile;
    }


    public String getDealId() {
        return dealId;
    }

    public void setDealId(String dealId) {
        this.dealId = dealId;
    }

   ...

   ...


    ....

    public BigDecimal getAmount() {
        return amount;
    }

    public void setAmount(BigDecimal amount) {
        this.amount = amount;
    }

    ....


    public Integer getId() {
        return id;
    }

    public void setId(Integer id) {
        this.id = id;
    }

`

  1. Persisting code (service) :

`

@Service
@Transactional
public class ServiceImpl implements MyService{

    @Autowired
    private MyDao dao;
....

`void foo(){
        for(MyObject d : listOfObjects_100000){
            dao.persist(d);
        }
}

` 4. Dao class :

`

@Repository
public class DaoImpl implements MyDao{

    @PersistenceContext
    private EntityManager em;

    public void persist(Deal deal){
        em.persist(deal);
    }
}

`

Logs: `

DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:32.906 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:32.906 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:32.906 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:32.906 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:32.906 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:32.906 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:32.906 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:32.906 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:32.906 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:32.906 [http-nio-8080-exec-2] 

... ...

DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:34.002 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:34.002 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:34.002 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:34.002 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:34.002 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:34.002 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:34.002 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:34.002 [http-nio-8080-exec-2] DEBUG o.h.e.j.batch.internal.BatchingBatch - Executing batch size: 27
18:26:34.011 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - update deal_source_file set invalid_rows=?, source_file=?, valid_rows=? where id=?
18:26:34.015 [http-nio-8080-exec-2] DEBUG o.h.e.j.batch.internal.BatchingBatch - Executing batch size: 1
18:26:34.018 [http-nio-8080-exec-2] DEBUG o.h.e.t.i.jdbc.JdbcTransaction - committed JDBC Connection
18:26:34.018 [http-nio-8080-exec-2] DEBUG o.h.e.t.i.jdbc.JdbcTransaction - re-enabling autocommit
18:26:34.032 [http-nio-8080-exec-2] DEBUG o.s.orm.jpa.JpaTransactionManager - Closing JPA EntityManager [org.hibernate.jpa.internal.EntityManagerImpl@2354fb09] after transaction
18:26:34.032 [http-nio-8080-exec-2] DEBUG o.s.o.jpa.EntityManagerFactoryUtils - Closing JPA EntityManager
18:26:34.032 [http-nio-8080-exec-2] DEBUG o.h.e.j.internal.JdbcCoordinatorImpl - HHH000420: Closing un-released batch
18:26:34.032 [http-nio-8080-exec-2] DEBUG o.h.e.j.i.LogicalConnectionImpl - Releasing JDBC connection
18:26:34.033 [http-nio-8080-exec-2] DEBUG o.h.e.j.i.LogicalConnectionImpl - Released JDBC connection

'

13
  • what is your batchsize. If from your time more is spent on I/O then you can try by increasing the batchsize.Also you can consider multi threading to insert some chunk of data.As there is no contentious point it should insert parallel way. Commented May 29, 2017 at 13:31
  • @gati sahu My configured batch size is 30, though I see this in logs Executing batch size: 27 . Also I see Executing batch size: 27 only once. Kindly see logs, which I have updated. Another point is that I have read that even with multi threading, the JDBC driver inserts are in synchronised() method, hence actual inserts happen one by one? Commented May 29, 2017 at 13:39
  • @JacobBelanger How would internet pose a problem? It is a localhost environment with local tomcat & mysql running! Commented May 29, 2017 at 13:40
  • 1
    Make sure you know where your bottleneck is. The insert-statements should only be executed at commit-time, so what part of the 35 seconds is used doing database-busines and which part is just hibernate-overhead? How is the heap-usage? would flush()ing the em in some interval help? Commented May 29, 2017 at 13:42
  • 3
    1. Don't use DriverManagerDataSource use a proper connection pool. 2. Don't use Spring for tx management and mess around with the hibernate.current_session_context_class as that breaks proper integration. 3. your for loop is kind of flawed, you should flush and clear the entity manager after each x records (preferably the same as your batch size setting). Commented May 29, 2017 at 14:41

4 Answers 4

29

After trying all possible solutions I finally found a solution to insert 100,000 rows under 5 seconds!

Things I tried:

1) Replaced hibernate/database's AUTOINCREMENT/GENERATED id's by self generated ID's using AtomicInteger

2) Enabling batch_inserts with batch_size=50

3) Flushing cache after every 'batch_size' number of persist() calls

4) multithreading (did not attempt this one)

Finally what worked was using a native multi-insert query and inserting 1000 rows in one sql insert query instead of using persist() on every entity. For inserting 100,000 entities, I create a native query like this "INSERT into MyTable VALUES (x,x,x),(x,x,x).......(x,x,x)" [1000 row inserts in one sql insert query]

Now it takes around 3 seconds for inserting 100,000 records! So the bottleneck was the orm itself! For bulk inserts, the only thing that seems to work is native insert queries!

Sign up to request clarification or add additional context in comments.

5 Comments

The native multi-insert query rewrite that you did yourself, you can have the mysql/mariadb driver will do for you automatically if you set rewriteBatchedStatements=true for the connection. No need to rewrite the application!
I don't see it combined in my logs after adding <property name="reWriteBatchedInserts" value="true"> to my persistence.xml This article is also not clear of where to take PGSimpleDatasource from: vladmihalcea.com/…
@PerHuss with your native query approach what will happen if one or some of the inserts fail due to some integrity constraint violation?
how will you handle sql injection here?
i don't believe it's so fast, i used rewriteBatchedStatements and it inserts 17k records in 8 seconds.
7
  1. You are using Spring for managing the transaction but break it by using thread as the current session context. When using Spring to manage your transactions don't mess around with the hibernate.current_session_context_class property. Remove it.

  2. Don't use the DriverManagerDataSource use a proper connection pool like HikariCP.

  3. In your for loop you should flush and clear the EntityManager at regular intervals, preferably the same as your batch size. If you don't a single persist takes longer and longer, because when you do that Hibernate checks the first level cache for dirty objects, the more objects the more time it takes. With 10 or 100 it is acceptable but checking 10000s of objects for each persist will take its toll.

-

@Service
@Transactional
public class ServiceImpl implements MyService{

    @Autowired
    private MyDao dao;

    @PersistenceContext
    private EntityManager em;


    void foo(){
        int count = 0;
        for(MyObject d : listOfObjects_100000){
            dao.persist(d);
            count++;
            if ( (count % 30) == 0) {
               em.flush();
               em.clear();
            }    
        }
    }

For a more in depth explanation see this blog and this blog.

5 Comments

I did (1) & (3) and it had no effect. It still takes ~30 seconds for 100,000 inserts
Looks like this is as good as it gets? I will try multithreading as a last resort.
Multithreading will definitely help. A colleague of mine did that, splitting an ETL process in 10 subprocesses. The gain was around 7x.
Multithreading might help it might also make things work. It depends on what kind of processing needs to be done, what kind of transactionality is used etc. Never underestimate the power of a single thread.
I think it meant to say "Multithreading might help it might also make things worse"
2

Another option to consider is StatelessSession:

A command-oriented API for performing bulk operations against a database.

A stateless session does not implement a first-level cache nor interact with any second-level cache, nor does it implement transactional write-behind or automatic dirty checking, nor do operations cascade to associated instances. Collections are ignored by a stateless session. Operations performed via a stateless session bypass Hibernate's event model and interceptors. Stateless sessions are vulnerable to data aliasing effects, due to the lack of a first-level cache.

For certain kinds of transactions, a stateless session may perform slightly faster than a stateful session.

Related discussion: Using StatelessSession for Batch processing

1 Comment

But I need to work with Hibernate's JPA API's. hence need to work with EntityManager and not Session Is it possible with EntityManagers? Maybe I can try and disable the 1st level cache from EntityManager ?
-1

Uff. You can do a lot of things to increase speed.

1.) Use @DynamicInsert and @DynamicUpdate to prevent the DB from inserting non-empty columns and updating changed columns.

2.) Try to insert the columns directly (without using hibernate) into your database to see if hibernate is really your bottleneck.

3.) Use a sessionfactory and only commit your transaction every e.g. 100 inserts. Or only open and close the transaction once and flush your data every 100 inserts.

4.) Use the ID generation strategy "sequence" and let hibernate preallocate (via the parameter allocationsize) the IDs.

5.) Use caches.

Some of this possible solutions can have timing disadvantages when not used correctly. But you have a lot of opportunities.

2 Comments

1. can actually lead to worse performance as hibernate needs to create a single query per row and needs to check if a field changed... All of that takes time and might even destroy the possibility to batch statements resulting in single statements instead of batched statements.
Yes M. Deinum, can but must not! If you are having columns with BLOBs than those annotations give you a lot more performance because they will not be updated when unchanged! As I wrote: "some options can have timinig disadvantages".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.