Imagine that you working with a large dataset, distributed over a bunch of CSV files. You open an IPython notebook and explore stuff, do some transformations, reorder and clean up data.
Then you start doing some experiments with the data, create some more notebooks and in the end find yourself heaped up with a bunch of different notebooks which have data transformation pipelines buried in them.
How to organize data exploration/transformation/learning-from-it process in such a way, that:
- complexity doesn't blow, raising gradually;
- keep your codebase managable and navigable;
- be able to reproduce and adjust data transformation pipelines?