2

Imagine that you working with a large dataset, distributed over a bunch of CSV files. You open an IPython notebook and explore stuff, do some transformations, reorder and clean up data.

Then you start doing some experiments with the data, create some more notebooks and in the end find yourself heaped up with a bunch of different notebooks which have data transformation pipelines buried in them.

How to organize data exploration/transformation/learning-from-it process in such a way, that:

  • complexity doesn't blow, raising gradually;
  • keep your codebase managable and navigable;
  • be able to reproduce and adjust data transformation pipelines?

1 Answer 1

1

Well, I have this problem now and then when working with a big set of data. Complexity is something I learned to live with, sometimes it's hard to keep things simple.

What i think that help's me a lot is putting all in a GIT repository, if you manage it well and make frequent commits with well written messages you can track the transformation to your data easily.

Every time I make some test, I create a new branch and do my work on it. If it gets to nowhere I just go back to my master branch and keep working from there, but the work I done is still available for reference if I need it.

If it leads to something useful I just merge it to my master branch and keep working on new tests, making new branches, as needed.

I don't think it answer all of your question and also don't know if you already use some sort version control in your notebooks, but that is something that helps me a lot and I really recommend it when using jupyter-notebooks.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.