How to improve elasticsearch performance

Question

I write data using parallel_bulk function in python to elasticsearch, but the performance is very low, I write 10000 data, and it consumes 180s, and I set the settings:

"settings": {
            "number_of_shards": 5,
            "number_of_replicas": 0,
            "refresh_interval": "30s",
            "index.translog.durability": "async",
            "index.translog.sync_interval": "30s"
       }

and in the elasticsearch.yml, I set:

bootstrap.memory_lock: true
indices.memory.index_buffer_size: 20%
indices.memory.min_index_buffer_size: 96mb
# Search pool
thread_pool.search.size: 5
thread_pool.search.queue_size: 100

thread_pool.bulk.queue_size: 300
thread_pool.index.queue_size: 300

indices.fielddata.cache.size: 40%

discovery.zen.fd.ping_timeout: 120s
discovery.zen.fd.ping_retries: 6
discovery.zen.fd.ping_interval: 30s

But it doesn't improve the performance, how can I do it? I use elasticsearch6.5.4 on windows10, and only one node, and I yield data from Oracle to elasticsearch.

The fastest way to pass data from an external db to es is to make a dump of the oracle db - in that way elastic.co/guide/en/elasticsearch/reference/6.2/…. Once created the json dump you ingest it in elastic with curl bulk command: elastic.co/guide/en/elasticsearch/reference/6.2/docs-bulk.html — Lupanoide
– Lupanoide, Commented Jan 17, 2019 at 8:49
to have an idea about the format of the json dump please take a look of the wikipedia production elasticsearch dump dumps.wikimedia.org/other/cirrussearch/current/… — Lupanoide
– Lupanoide, Commented Jan 17, 2019 at 8:51
take a look here: qbox.io/blog/understanding-bulk-indexing-in-elasticsearch — Lupanoide
– Lupanoide, Commented Jan 17, 2019 at 8:53

Lupanoide · Accepted Answer · 2019-01-17 10:12:45Z

According the code of the yesterday's post, You can try to create an es dump of oracle DB:

class CreateDump(object):
def __init__():
    self.output = r"/home/littlely/Scrivania/oracle_dump.json"
    self.index_name = "your_index_name"
    self.doc_type = "your_doc_type"
def _gen_data(self, index, doc_type, chunk_size):
    sql = """select * from tem_search_engine_1 where rownum <= 10000"""  
    self.cursor.execute(sql)
    col_name_list = [col[0].lower() for col in self.cursor.description]
    col_name_len = len(col_name_list)
    actions = []

    start = time.time()
    for row in self.cursor:
        source = {}
        tbl_id = ""
        for i in range(col_name_len):
            source.update({col_name_list[i]: str(row[i])})
            if col_name_list[i] == "tbl_id":
                tbl_id = row[i]
    self.writeOnFS(source, tbl_id)

def writeOnFS(source, tbl_id):
        with open(self.output, 'a') as f:
            prep = json.dumps({"index":{"_index" : self.index_name, "_type" : self.doc_type, "_id" : tbl_id}})
            data = json.dumps(source)
            print(data)
            f.write(prep + " \n")
            f.write(data + " \n")

Then you will find the oracle dump in self.output path. So you need only to bulk your json file - the binary path is the self.output path:

curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/<your_index_name>/<your_doc_type)/_bulk --data-binary @/home/littlely/Scrivania/oracle_dump.json

OR if is it too big, install GNU PARAllEl. In Ubuntu:

sudo apt-get install parallel

and then:

cat /home/littlely/Scrivania/oracle_dump.json.json | parallel --pipe -L 2 -N 2000 -j3 'curl -H "Content-Type: application/x-ndjson" -s http://localhost:9200/<your_index_name>/_bulk --data-binary @- > /dev/null'

Enjoy!

Collectives™ on Stack Overflow

How to improve elasticsearch performance

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related