How to manage complexity while using IPython notebooks?

Question

Imagine that you working with a large dataset, distributed over a bunch of CSV files. You open an IPython notebook and explore stuff, do some transformations, reorder and clean up data.

Then you start doing some experiments with the data, create some more notebooks and in the end find yourself heaped up with a bunch of different notebooks which have data transformation pipelines buried in them.

How to organize data exploration/transformation/learning-from-it process in such a way, that:

complexity doesn't blow, raising gradually;
keep your codebase managable and navigable;
be able to reproduce and adjust data transformation pipelines?

DSLima90 · Accepted Answer · 2017-04-18 20:29:41Z

Well, I have this problem now and then when working with a big set of data. Complexity is something I learned to live with, sometimes it's hard to keep things simple.

What i think that help's me a lot is putting all in a GIT repository, if you manage it well and make frequent commits with well written messages you can track the transformation to your data easily.

Every time I make some test, I create a new branch and do my work on it. If it gets to nowhere I just go back to my master branch and keep working from there, but the work I done is still available for reference if I need it.

If it leads to something useful I just merge it to my master branch and keep working on new tests, making new branches, as needed.

I don't think it answer all of your question and also don't know if you already use some sort version control in your notebooks, but that is something that helps me a lot and I really recommend it when using jupyter-notebooks.

Collectives™ on Stack Overflow

How to manage complexity while using IPython notebooks?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related