How can I load and merge several .txt files in a memory efficient way in python?

Question

I am trying to read several (>1000) .txt files (on average approx. 700 MB, delimited, header-less CSV, without commas or other separator) and merge them into one pandas dataframe (to next run an analysis on the entire dataset).

I am running this through SSH on a HPC server, on which I requested 50GB RAM, 1 node, 1 task-per-node (that was all just a wild guess, as I have never done this before).

So far, my idea was this:

li = []

for filename in all_files:
    df = pd.read_csv(filename, sep=None, header=0, engine='python')
    li.append(df)

df = pd.concat(li, axis=0, ignore_index=True)

but after a few hours and having loaded the approx. 360th file the process gets killed and I get the error message:

numpy.core._exceptions.MemoryError: Unable to allocate 1.11 GiB for an array with shape (10, 14921599) and data type float64

Do you have any idea how to load and merge the data more memory efficient? (I assume just requesting more RAM still does not get me through the entire set of .txt files!?)

Also, I would like to save the resulting dataframe in a 'memory-efficient' way afterwards, do you know the best way/format (csv?) to do that?

Any help would be much appreciated!

The error is very suspicious. This means you are trying to load a CSV with 15 millions columns and 10 rows. Having so many columns in a CSV file is very unusual and CSV files are clearly inefficient for such a case. How big are your input files: what is they shape and the average size of the files on the storage device? — Jérôme Richard
– Jérôme Richard, Commented Feb 13, 2022 at 13:10
Ok, so I thing your input file is wrongly parsed. The error can be due to a badly formatted file (eg. issue with CR LF characters, undelimited strings), wrong delimiters or a bug in the parser. Can you check the file is properly formatted or post a link to such the file that cause the issue? (the first lines might be sufficient to cache the issue and find possibly better parsing functions) — Jérôme Richard
– Jérôme Richard, Commented Feb 13, 2022 at 15:34
@MarkusEyting I am confuse about the "700.000 kb" too and the "0.7 MB" now edited to 700 MB. If you try to load >1000 files of 700 MB each, then it is normal your process got killed since you do not have enough RAM like user3666197 said. You could perform some big-data computation (ie. not in RAM) but this is a totally different problem. Can you clarify this point? — Jérôme Richard
– Jérôme Richard, Commented Feb 14, 2022 at 20:36
@MarkusEyting Ok. First of all, memory is progressively allocated so this is why the process does not directly crashes: each CSV takes some memory space and it progressively fill the RAM as they are loaded. Moreover, the final pd.concat allocates an additional space so to copy the output CSV so you need twice the space. Finally, computing machine like some PC have a swap memory. Swap memory is an additional amount of virtual memory that is stored in (much much slower) storage devices. When there is not enough space, the OS becomes very slow and kill some processes (critical case). — Jérôme Richard
– Jérôme Richard, Commented Feb 14, 2022 at 21:22
@MarkusEyting You can only "clear the RAM" if your data are stored in a storage device (or computed, or even sent somewhere). But is this what you want: do you plan to write the concatenated df? If so, then the answer of OmidRoshani does the job (although there is probably a faster way to do that). Note that you should call del df after its loop so to reduce the memory footprint. Note that this operation is slow because of file storages and CSV parsing. — Jérôme Richard
– Jérôme Richard, Commented Feb 14, 2022 at 21:45

Omid Roshani · Accepted Answer · 2022-02-13 11:43:23Z

1

as you said you have so many files and it needs so much memory so I suggest loading and saving all of the files in a single file in appending mode (append data to the previously saved data) like this

for filename in all_files:
    df = pd.read_csv(filename, sep=None, header=0, engine='python')
    df.to_csv('./data.csv', header=None, index=None, mode='a')

after saving all of the files in single file now you can read the single file as a dataframe like this:

df = pd.read_csv('./data.csv',header=None,index_col=False)

after that if you have any issues with reading this file because of memory you can use a reader like this:

chunksize = 10 ** 6
with pd.read_csv('./data.csv',header=None,index_col=False, chunksize=chunksize) as reader:
    for chunk in reader:
        # Do What you want

answered Feb 13, 2022 at 11:43

Omid Roshani

1,1931 gold badge9 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user3666197 Over a year ago

With all due respect, the proposal is not solving the root-cause of the use-case ill-performing the goal. File-I/O costs for python run a "merge" of from-file-to-file data is too expensive. The segmented, chunk-based, iterator for any next re-reading the supra-dataframe file is not good either - some ML-type-of-work run row-based ( where chunks need not hurt us so much ), yet other calculations ( scalers, bias-removal & other ) run column-based, where row-based chunk-iterators will ill-perform & are almost rather a wished to get HPC performance anti-pattern. Some way other design needed here

user3666197 · Accepted Answer · 2022-02-14 15:59:49Z

Q : "How can I ... Any help would be much appreciated!"

A :
best follow the laws of the ECONOMY-of-COMPUTING :

Your briefly sketched problem has, out of question, immense "setup"-costs, having unspecified amount of some useful work to be computed on an unspecified HPC-ecosystem.

Even without hardware & rental details ( devil is always hidden in detail(s) & one can easily pay hilarious amounts of money for trying to make a (hiddenly) "shared"-platform deliver any improved computing performance - many startups have experienced this on voucher-sponsored promises, the more if an overall computing strategy was poorly designed )

I cannot resist not to quote the so called 1st Etore's Law of Evolution of Systems' Dynamics :

If we open a can of worms,
the only way how to put them back
is to use a bigger can

Closing our eyes not to see the accumulating inefficiencies is the worst sin of sins, as devastatingly exponential growths of all of costs, time & resources, incl. energy-consumption are common to meet on such, often many-levels stacked-inefficiencies' complex systems

ELEMENTARY RULES-of-THUMB ... how much we pay in [TIME]

Sorry if these were known to you beforehand, just trying to build some common ground, as a platform to lay further argumentation rock-solid on. More details are here and this is only a needed beginning, as more problems will definitely come from any real-world O( M^x * N^y * ... )-scaling related issues in further modelling.

                0.1 ns - CPU NOP - a DO-NOTHING instruction
                0.5 ns - CPU L1 dCACHE reference           (1st introduced in late 80-ies )
                1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
              3~4   ns - CPU L2  CACHE reference           (2020/Q1)
                7   ns - CPU L2  CACHE reference
               19   ns - CPU L3  CACHE reference           (2020/Q1 considered slow on 28c Skylake)
______________________on_CPU______________________________________________________________________________________
               71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
              100   ns - own DDR MEMORY reference
              135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
              325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
            2,500   ns - Read  10 kB sequentially from  MEMORY------ HPC-node
           25,000   ns - Read 100 kB sequentially from  MEMORY------ HPC-node
          250,000   ns - Read   1 MB sequentially from  MEMORY------ HPC-node
        2,500,000   ns - Read  10 MB sequentially from  MEMORY------ HPC-node
       25,000,000   ns - Read 100 MB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
      250,000,000   ns - Read   1 GB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
    2,500,000,000   ns - Read  10 GB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
   25,000,000,000   ns - Read 100 GB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
_____________________________________________________________________________own_CPU/DDR__________________________
 |   |   |   |   |
 |   |   |   | ns|
 |   |   | us|
 |   | ms|
 |  s|
h|

          500,000   ns - Round trip within a same DataCenter ------- HPC-node / HPC-storage latency on each access
       20,000,000   ns - Send   2 MB over 1 Gbps  NETWORK
      200,000,000   ns - Send  20 MB over 1 Gbps  NETWORK
    2,000,000,000   ns - Send 200 MB over 1 Gbps  NETWORK
   20,000,000,000   ns - Send   2 GB over 1 Gbps  NETWORK
  200,000,000,000   ns - Send  20 GB over 1 Gbps  NETWORK
2,000,000,000,000   ns - Send 200 GB over 1 Gbps  NETWORK
____________________________________________________________________________via_LAN_______________________________
      150,000,000   ns - Send a NETWORK packet CA -> Netherlands
____________________________________________________________________________via_WAN_______________________________
       10,000,000   ns - DISK seek spent to start file-I/O on spinning disks on any next piece of data seek/read
       30,000,000   ns - DISK   1 MB sequential READ from a DISK
      300,000,000   ns - DISK  10 MB sequential READ from a DISK
    3,000,000,000   ns - DISK 100 MB sequential READ from a DISK
   30,000,000,000   ns - DISK   1 GB sequential READ from a DISK
  300,000,000,000   ns - DISK  10 GB sequential READ from a DISK
3,000,000,000,000   ns - DISK 100 GB sequential READ from a DISK
______________________on_DISK_______________________________________________own_DISK______________________________
 |   |   |   |   |
 |   |   |   | ns|
 |   |   | us|
 |   | ms|
 |  s|
h|

Given these elements, the end-to-end computing strategy may and shall be improved.

AS-WAS STATE
... where the crash prevented any computing at all

A naive figure shows more than thousands words

localhost          |
:       file-I/O ~ 25+GB SLOWEST/EXPENSIVE
:         1st time 25+GB file-I/O-s
:                  |
:                  | RAM       |
:                  |           |
+------+           |IOIOIOIOIOI|
|.CSV 0|           |IOIOIOIOIOI|
|+------+          |IOIOIOIOIOI|
||.CSV 1|          |IOIOIOIOIOI|
||+------+         |IOIOIOIOIOI|-> local ssh()-encrypt+encapsulate-process
|||.CSV 2|         |IOIOIOIOIOI|               25+GB of .CSV
+||      |         |IOIOIOIOIOI|~~~~~~~|
 ||      |         |IOIOIOIOIOI|~~~~~~~|
 +|      |         |           |~~~~~~~|
  |      |         |           |~~~~~~~|
  +------+         |           |~~~~~~~|
  ...              |           |~~~~~~~|-> LAN  SLOW
   ...             |                   |   WAN  SLOWER
    ...            |                   |   transfer of 30+GB to "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
    +------+       |                   |               |                   30+GB           file-I/O ~ 25+GB SLOWEST/EXPENSIVE
    |.CSV 9|       |                   |~~~~~~~~~~~~~~~|                                     2nd time 25+GB file-I/O-s
    |+------+      |                   |~~~~~~~~~~~~~~~|
    ||.CSV 9|      |                   |~~~~~~~~~~~~~~~|
    ||+------+     |                   |~~~~~~~~~~~~~~~|
    |||.CSV 9|     |                   |~~~~~~~~~~~~~~~|
    +||     9|     |                   |~~~~~~~~~~~~~~~|
     ||     9|     |                   |~~~~~~~~~~~~~~~|
     +|      |     |                   |~~~~~~~~~~~~~~~|
      |      |     |                   |~~~~~~~~~~~~~~~|-> file-I/O into python
      +------+     |                                   |   all .CSV file to RAM ~ 25+GB SLOWEST/EXPENSIVE
                   |                                   |***|                        3rd time 25+GB file-I/O-s
                   |                                   |   RAM .CSV to df CPU work
                   |                                   |***|           df to LIST new RAM-allocation + list.append( df )-costs
                   |                                   |***|                    + 25+GB
                   |                                   |***|
many hours         |                                   |***|
  [SERIAL] flow ...|                                   |***|
/\/\/\/\/\/\/\/\/\/|\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\|***|/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
                   |                       crashed     |***|
                   |                    on about       |***|
                   |                       360-th file |***|
                   |                                   |***|->RAM ~ 50~GB with a LIST of all 25+GB dataframes held in LIST
                   |                                       |  CPU +mem-I/O costs LIST to new 25+GB dataframe RAM-allocation & DATA-processing
                   |                                       |~~~~~| mem-I/O RAM|              :: GB
                   |                                       |~~~~~| mem-I/O    |RAM flow of ~ 50+GB over only 2/3/? mem-I/O HW-channels
                   |                                       |~~~~~|                                      only if "HPC"
                   |                                       |~~~~~|                                           is *NOT* a "shared"-rental of cloud HW,
                   |                                       |~~~~~|                                                    remarketed as an "HPC"-illusion
                   |                                       |~~~~~|
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                            <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
                   |                                            <...some amount of some usefull work                                         ...>
                   |                                            <...some amount of some usefull work         the more                        ...>
                   |                                            <...some amount of some usefull work         the better                      ...>
                   |                                            <...some amount of some usefull work             as                          ...>
                   |                                            <...some amount of some usefull work             it                          ...>
                   |                                            <...some amount of some usefull work             dissolves to AWFULLY        ...>
                   |                                            <...some amount of some usefull work                          HIGH           ...>
                   |                                            <...some amount of some usefull work                          SETUP COSTS    ...>
                   |                                            <...some amount of some usefull work                                         ...>
                   |                                            <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                                          |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
                   |                                                          |~~~~~|          4th time 25+GB file-I/O-s
                   |                                                          |~~~~~|
                   |                                                          |~~~~~|->file left on remote storage (?)
                   |                                                                |
                   |                                                               O?R
                   |                                                                |
                   |                                                                |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
                   |                                                                |~~~~~|          5th time 25+GB file-I/O-s
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|-> RAM / CPU ssh()-encrypt+encapsulate-process
                   |                                                                      |????????????|      25+GB of results for repatriation
                   |                                                                      |????????????|                        on localhost
                   |                                                                      |????????????|
                   |                                                                      |????????????|
                   |                                                                      |????????????|-> LAN  SLOW
                   |                                                                                   |   WAN  SLOWER
                   |                                                                                   |   transfer of 30+GB from "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
                   |                                                                                   |               |                     30+GB           file-I/O ~ 25+GB SLOWEST/EXPENSIVE
                   |                                                                                   |~~~~~~~~~~~~~~~|                                       6th time 25+GB file-I/O-s
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
 SUCCESS ?         |                                                                                   |~~~~~~~~~~~~~~~|-> file transferred back and stored on localhost storage
    after          |
          how many |
          failed   |
          attempts |
   having          |
          how high |
          recurring|
          costs    |
          for any  |
              next |
              model|
          recompute|
          step(s)  |
                   |
                   |
All                |
that               |
( at what overall  |
    [TIME]-domain  |
   & "HPC"-rental  |
           costs ) |

Tips :

review and reduce, where possible, expensive data-items representation ( avoid using int64, where 8-bits are enough, packed bitmaps can help a lot )
precompute on localhost all items, that could be precomputed ( avoiding repetitive steps )
join the such "reduced" CSV-files, using a trivial O/S command, into a single input
compress all data before transports ( a few orders of magnitude saved )
prefer to code your computing using such algorithms' formulation, that can stream-process items along the data-flow, i.e. not waiting to load all in-RAM to next compute an average or similar trivial on-the-fly stream-computable values ( like .mean(), .sum(), .min(), .max() or even .rsi(), .EMA(), .TEMA(), .BollingerBands() ... and many more alike ) - stream-computing formulated algorithms reduce both the RAM-allocations, can be & shall be pre-computed (once) & minimise the [SERIAL]-one-after-another processing pipeline latency )
if indeed in a need to use pandas and fighting on overall physical RAM-ceilings, may try smart numpy-tools instead, where all the array syntax & methods remain the same, yet it can, by-design, work without moving all data at once from disk into physical RAM ( using this was my life-saving trick since ever, the more when using many-model simulations & HyperParameterSPACE optimisations on a few tens of GB data on 32-bit hardware )

For more details on going into the direction of RAM-protecting memory-mapped np.ndarray processing, with all smart numpy-vectorised and all other high performance-tuned tricks, read more details in this :

>> print( np.memmap.__doc__ )
Create a memory-map to an array stored in a *binary* file on disk.

   Memory-mapped files are used for accessing small segments of large files
   on disk, without reading the entire file into memory.  NumPy's
   memmap's are array-like objects. (...)

Collectives™ on Stack Overflow

How can I load and merge several .txt files in a memory efficient way in python?

2 Answers 2

1 Comment

ELEMENTARY RULES-of-THUMB ... how much we pay in [TIME]

AS-WAS STATE
... where the crash prevented any computing at all

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

ELEMENTARY RULES-of-THUMB ... how much we pay in [TIME]

AS-WAS STATE... where the crash prevented any computing at all

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

AS-WAS STATE
... where the crash prevented any computing at all