0

I am currently required to generate a few hundred GB worth of test data for an ElasticSearch index for load testing. This means hundreds of millions of "docs".

The format I need to generate looks like:

{
field 1 : <ip>
field 2 : <uuid>
field 3 : <date>
field 4: <json-array of string>
...
...
}

for around 40-50 fields per doc. Generated data needs to match the index template for the specific index I am required to test.

Ok, so sounds straight forward right? A normal JSON dataset generator that can handle generating a few hundred million Json docs is the way to go provided I can find one that supports format.

The problem is that the ES Bulk upload API requires upload to be supplied following way. For EACH doc, first a "command json" containing meta-data for a doc, then the json doc itself:

POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }

The one free solution that look like it might support generating huge datasets only support generating Json with uniform format. Which means I can't make it generate the command followed by the doc.

So I tried to generate using my own bash script based on some pre-existing data (I randomize doc-id and some other fields) to generate data. But the problem with that is I need to run my bash script in parallel up to 100s of times at once to generate the data in a timely manner. And the /dev/urandom in bash is "conflicting", as in it is generating the -same- random data across different scripts when ran in parallel when I need the doc-id to be unique.

This is getting long, but any help for either

1) A free solution which can generate the large datasets in JSON and in the format I need

OR

2) A solution for bash random generation process when ran in parallel

Would be appreciated. Thanks.

3
  • Should have added, data generated need to fit specific index template. So I can't just ingest the wikipedia data, as it would not fit the specific template. Thank you. Commented Mar 25, 2020 at 13:51
  • can't you use the same JSON with some different value,using the Jmeter ? Commented Mar 25, 2020 at 15:40
  • 2
    There are several solutions: stackoverflow.com/a/40586333/4604579 (using logstash) or stackoverflow.com/a/33981143/4604579 (using python) or stackoverflow.com/a/45604500/4604579 (using jq) Commented Mar 25, 2020 at 16:19

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.