2

I write data using parallel_bulk function in python to elasticsearch, but the performance is very low, I write 10000 data, and it consumes 180s, and I set the settings:

"settings": {
            "number_of_shards": 5,
            "number_of_replicas": 0,
            "refresh_interval": "30s",
            "index.translog.durability": "async",
            "index.translog.sync_interval": "30s"
       }

and in the elasticsearch.yml, I set:

bootstrap.memory_lock: true
indices.memory.index_buffer_size: 20%
indices.memory.min_index_buffer_size: 96mb
# Search pool
thread_pool.search.size: 5
thread_pool.search.queue_size: 100

thread_pool.bulk.queue_size: 300
thread_pool.index.queue_size: 300

indices.fielddata.cache.size: 40%

discovery.zen.fd.ping_timeout: 120s
discovery.zen.fd.ping_retries: 6
discovery.zen.fd.ping_interval: 30s

But it doesn't improve the performance, how can I do it? I use elasticsearch6.5.4 on windows10, and only one node, and I yield data from Oracle to elasticsearch.

3

1 Answer 1

3

According the code of the yesterday's post, You can try to create an es dump of oracle DB:

class CreateDump(object):
def __init__():
    self.output = r"/home/littlely/Scrivania/oracle_dump.json"
    self.index_name = "your_index_name"
    self.doc_type = "your_doc_type"
def _gen_data(self, index, doc_type, chunk_size):
    sql = """select * from tem_search_engine_1 where rownum <= 10000"""  
    self.cursor.execute(sql)
    col_name_list = [col[0].lower() for col in self.cursor.description]
    col_name_len = len(col_name_list)
    actions = []

    start = time.time()
    for row in self.cursor:
        source = {}
        tbl_id = ""
        for i in range(col_name_len):
            source.update({col_name_list[i]: str(row[i])})
            if col_name_list[i] == "tbl_id":
                tbl_id = row[i]
    self.writeOnFS(source, tbl_id)

def writeOnFS(source, tbl_id):
        with open(self.output, 'a') as f:
            prep = json.dumps({"index":{"_index" : self.index_name, "_type" : self.doc_type, "_id" : tbl_id}})
            data = json.dumps(source)
            print(data)
            f.write(prep + " \n")
            f.write(data + " \n") 

Then you will find the oracle dump in self.output path. So you need only to bulk your json file - the binary path is the self.output path:

curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/<your_index_name>/<your_doc_type)/_bulk --data-binary @/home/littlely/Scrivania/oracle_dump.json

OR if is it too big, install GNU PARAllEl. In Ubuntu:

sudo apt-get install parallel

and then:

cat /home/littlely/Scrivania/oracle_dump.json.json | parallel --pipe -L 2 -N 2000 -j3 'curl -H "Content-Type: application/x-ndjson" -s http://localhost:9200/<your_index_name>/_bulk --data-binary @- > /dev/null'

Enjoy!

Sign up to request clarification or add additional context in comments.

1 Comment

When writeOnFS, it's still very slow

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.