1

I have a 500+ MB CSV data file. My question is, which would be faster for data manipulation (e.g., reading, processing) is the Python MySQL client would be faster since all work is mapped into SQL queries and optimization is left to the optimizer. But, at the same time Pandas is dealing with a file which should be faster than communicating with a server?

I have already checked "Large data" work flows using pandas, Best practices for importing large CSV files, Fastest way to write large CSV with Python, and Most efficient way to parse a large .csv in python?. However, I haven't really found any comparison regarding Pandas and MySQL.

Use Case:

I am working on text dataset that consists of 1,737,123 rows and 8 columns. I am feeding this dataset into RNN/LSTM network. I do some preprocessing in prior to feeding which is encoding using a customized encoding algorithm.

More details

I have 250+ experiments to do and 12 architectures (different models design) to try.

I am confused, I feel I miss something.

5
  • I've found the fastest way for loading MySQL data, is to do it through LOAD DATA INFILE. It's by far, the most efficient route. Commented Oct 20, 2018 at 19:46
  • @FrankerZ Could you please elaborate whether do you mean the most efficient even when comparing with other Python techniques, or it's the most when loading from MySQL? Commented Oct 20, 2018 at 19:48
  • 1
    Voting to close as unclear: impossible to answer without knowing your use scenario(s). Commented Oct 20, 2018 at 19:53
  • Okay, I will give more details. Commented Oct 20, 2018 at 19:55
  • @ivan_pozdeev is the use case scenario clear enough? Commented Oct 20, 2018 at 19:59

1 Answer 1

3

There's no comparison online 'cuz these two scenarios give different results:

  • With Pandas, you end up with a Dataframe in memory (as a NumPy ndarray under the hood), accessible as native Python objects
  • With MySQL client, you end up with data in a MySQL database on disk (unless you're using an in-memory database), accessible via IPC/sockets

So, the performance will depend on

  • how much data needs to be transferred by lower-speed channels (IPC, disk, network)
  • how comparatively fast is transferring vs processing (which of them is the bottleneck)
  • which data format your processing facilities prefer (i.e. what additional conversions will be involved)

E.g.:

  • If your processing facility can reside in the same (Python) process that will be used to read it, reading it directly into Python types is preferrable since you won't need to transfer it all to the MySQL process, then back again (converting formats each time).
  • OTOH if your processing facility is implemented in some other process and/or language, or e.g. resides within a computing cluster, hooking it to MySQL directly may be faster by eliminating the comparatively slow Python from equation, and because you'll need to be transferring the data again and converting it into the processing app's native objects anyway.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.