Just a simple and straightforward question as to why pandas does not provide a method that allocates a DataFrame of a given size (defaults to List[Series[object]], or if schema is given, allocate to those dtypes accordingly)?
1 Answer
Generating randomized data in pandas is quite straightforward (How to create a DataFrame of random integers with Pandas?) Even with the added complexity of strings and dates it is quite doable in a few lines of code as long as you know EXACTLY what you need.
But as to why there isn't a convenient API endpoint to do so could be because of the vast number of possibilities and edge cases to handle. From the top of my head:
- Table schema definition (PANDAS lacks an elaborate schema management abstraction like in pyspark)
- String generation
- Date generation, timezone management
- Handling all other objects as column values
- Defining axes
The need for something like this isn't lost on me. Just that the desired outcome here has too much variability associated with it, so much so that building a general solution around it is not really feasible. It's almost always easier to load popular datasets from seaborn, sklearn for POCs, demos, etc..
Besides, I'm guessing most organisations in the analytics space have internal tools to generate sample datasets as per their needs (or better yet, sandbox DBs schematically identical to production DBs).
That said, I'd love to contribute if there's a feature request in the future :D