3

I am a beginner with elasticsearch and i have to write 1-million random events into an Elastic search cluster (hosted on the cloud), with a python script...

es = Elasticsearch(
    [host_name],
    port=9243,
    http_auth=("*****","*******"),
    use_ssl=True,
    verify_certs=True,
    ca_certs=certifi.where(),
    sniff_on_start=True
)

Here's my code for the indexing:

for i in range(1000000):

src_centers=['data center a','data center b','data center c','data center d','data center e']
transfer_src = np.random.choice(src_centers, p=[0.3, 0.175, 0.175, 0.175, 0.175])

dst_centers = [x for x in src_centers if x != transfer_src]
transfer_dst = np.random.choice(dst_centers)

final_transfer_status = ['transfer-success','transfer-failure']

transfer_starttime = generate_timestamp()
file_size=random.choice(range(1024,10000000000))
ftp={
    'event_type': 'transfer-queued',
    'uuid': uuid.uuid4(),
    'src_site' : transfer_src,
    'dst_site' : transfer_dst,
    'timestamp': transfer_starttime,
    'bytes' : file_size
}
print(i)
es.index(index='ft_initial', id=(i+1), doc_type='initial_transfer_details', body= ftp)

transfer_status = ['transfer-success', 'transfer-failure']
final_status = np.random.choice(transfer_status, p=[0.95,0.05])
ftp['event_type'] = final_status

if (final_status=='transfer-failure'):
    time_delay = 10
else :
    time_delay = int(transfer_time(file_size))   # ranges roughly from 0-10000 s 

ftp['timestamp'] = transfer_starttime + timedelta(seconds=time_delay)
es.index(index='ft_final', id=(i+1), doc_type='final_transfer_details', body=ftp)

Is there any alternate way to speed up the process??

Any help/pointers will be appreciated. Thanks.

3
  • What do you want to speed up? The indexing? The program itself? Please clarify your request Commented Mar 12, 2017 at 21:40
  • can you share your cluster topology with us, no of shards, node(master/data), your hardware specifications for cluster machines and also better add your elasticsearch.yml file . Commented Mar 13, 2017 at 4:51
  • Topology::{event_type: "transfer-queued", uuid: 471a885a-9d8a-4212-8ebc-d1bc96c91b3b, bytes: 5411345, timestamp: 2017-03-04T05:40:40 src_site: "data centre a", dst_site: "data centre c" } Commented Mar 14, 2017 at 10:59

1 Answer 1

3
  1. Use bulks, otherwise you have a lot of overhead for each single request: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
  2. Change the refresh rate, ideally disable it totally until you're done: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk
  3. Use monitoring (there's a free basic license) to see what is actually the bottleneck (IO, memory, CPU): https://www.elastic.co/guide/en/x-pack/current/xpack-monitoring.html
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks i solved it by doing exactly that.. used the helpers.bulk() function.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.