0

Just a simple and straightforward question as to why pandas does not provide a method that allocates a DataFrame of a given size (defaults to List[Series[object]], or if schema is given, allocate to those dtypes accordingly)?

1 Answer 1

0

Generating randomized data in pandas is quite straightforward (How to create a DataFrame of random integers with Pandas?) Even with the added complexity of strings and dates it is quite doable in a few lines of code as long as you know EXACTLY what you need.

But as to why there isn't a convenient API endpoint to do so could be because of the vast number of possibilities and edge cases to handle. From the top of my head:

  1. Table schema definition (PANDAS lacks an elaborate schema management abstraction like in pyspark)
  2. String generation
  3. Date generation, timezone management
  4. Handling all other objects as column values
  5. Defining axes

The need for something like this isn't lost on me. Just that the desired outcome here has too much variability associated with it, so much so that building a general solution around it is not really feasible. It's almost always easier to load popular datasets from seaborn, sklearn for POCs, demos, etc..

Besides, I'm guessing most organisations in the analytics space have internal tools to generate sample datasets as per their needs (or better yet, sandbox DBs schematically identical to production DBs).

That said, I'd love to contribute if there's a feature request in the future :D

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.