Q : "How can I ... Any help would be much appreciated!"
A :
best follow the laws of the ECONOMY-of-COMPUTING :
Your briefly sketched problem has, out of question, immense "setup"-costs, having unspecified amount of some useful work to be computed on an unspecified HPC-ecosystem.
Even without hardware & rental details ( devil is always hidden in detail(s) & one can easily pay hilarious amounts of money for trying to make a (hiddenly) "shared"-platform deliver any improved computing performance - many startups have experienced this on voucher-sponsored promises, the more if an overall computing strategy was poorly designed )
I cannot resist not to quote the so called 1st Etore's Law of Evolution of Systems' Dynamics :
If we open a can of worms,
the only way how to put them back
is to use a bigger can
Closing our eyes not to see the accumulating inefficiencies is the worst sin of sins, as devastatingly exponential growths of all of costs, time & resources, incl. energy-consumption are common to meet on such, often many-levels stacked-inefficiencies' complex systems
ELEMENTARY RULES-of-THUMB ... how much we pay in [TIME]
Sorry if these were known to you beforehand, just trying to build some common ground, as a platform to lay further argumentation rock-solid on. More details are here and this is only a needed beginning, as more problems will definitely come from any real-world O( Mx * Ny * ... )-scaling related issues in further modelling.
0.1 ns - CPU NOP - a DO-NOTHING instruction
0.5 ns - CPU L1 dCACHE reference (1st introduced in late 80-ies )
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
3~4 ns - CPU L2 CACHE reference (2020/Q1)
7 ns - CPU L2 CACHE reference
19 ns - CPU L3 CACHE reference (2020/Q1 considered slow on 28c Skylake)
______________________on_CPU______________________________________________________________________________________
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
2,500 ns - Read 10 kB sequentially from MEMORY------ HPC-node
25,000 ns - Read 100 kB sequentially from MEMORY------ HPC-node
250,000 ns - Read 1 MB sequentially from MEMORY------ HPC-node
2,500,000 ns - Read 10 MB sequentially from MEMORY------ HPC-node
25,000,000 ns - Read 100 MB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
250,000,000 ns - Read 1 GB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
2,500,000,000 ns - Read 10 GB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
25,000,000,000 ns - Read 100 GB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
_____________________________________________________________________________own_CPU/DDR__________________________
| | | | |
| | | | ns|
| | | us|
| | ms|
| s|
h|
500,000 ns - Round trip within a same DataCenter ------- HPC-node / HPC-storage latency on each access
20,000,000 ns - Send 2 MB over 1 Gbps NETWORK
200,000,000 ns - Send 20 MB over 1 Gbps NETWORK
2,000,000,000 ns - Send 200 MB over 1 Gbps NETWORK
20,000,000,000 ns - Send 2 GB over 1 Gbps NETWORK
200,000,000,000 ns - Send 20 GB over 1 Gbps NETWORK
2,000,000,000,000 ns - Send 200 GB over 1 Gbps NETWORK
____________________________________________________________________________via_LAN_______________________________
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
____________________________________________________________________________via_WAN_______________________________
10,000,000 ns - DISK seek spent to start file-I/O on spinning disks on any next piece of data seek/read
30,000,000 ns - DISK 1 MB sequential READ from a DISK
300,000,000 ns - DISK 10 MB sequential READ from a DISK
3,000,000,000 ns - DISK 100 MB sequential READ from a DISK
30,000,000,000 ns - DISK 1 GB sequential READ from a DISK
300,000,000,000 ns - DISK 10 GB sequential READ from a DISK
3,000,000,000,000 ns - DISK 100 GB sequential READ from a DISK
______________________on_DISK_______________________________________________own_DISK______________________________
| | | | |
| | | | ns|
| | | us|
| | ms|
| s|
h|
Given these elements, the end-to-end computing strategy may and shall be improved.
AS-WAS STATE
... where the crash prevented any computing at all
A naive figure shows more than thousands words
localhost |
: file-I/O ~ 25+GB SLOWEST/EXPENSIVE
: 1st time 25+GB file-I/O-s
: |
: | RAM |
: | |
+------+ |IOIOIOIOIOI|
|.CSV 0| |IOIOIOIOIOI|
|+------+ |IOIOIOIOIOI|
||.CSV 1| |IOIOIOIOIOI|
||+------+ |IOIOIOIOIOI|-> local ssh()-encrypt+encapsulate-process
|||.CSV 2| |IOIOIOIOIOI| 25+GB of .CSV
+|| | |IOIOIOIOIOI|~~~~~~~|
|| | |IOIOIOIOIOI|~~~~~~~|
+| | | |~~~~~~~|
| | | |~~~~~~~|
+------+ | |~~~~~~~|
... | |~~~~~~~|-> LAN SLOW
... | | WAN SLOWER
... | | transfer of 30+GB to "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
+------+ | | | 30+GB file-I/O ~ 25+GB SLOWEST/EXPENSIVE
|.CSV 9| | |~~~~~~~~~~~~~~~| 2nd time 25+GB file-I/O-s
|+------+ | |~~~~~~~~~~~~~~~|
||.CSV 9| | |~~~~~~~~~~~~~~~|
||+------+ | |~~~~~~~~~~~~~~~|
|||.CSV 9| | |~~~~~~~~~~~~~~~|
+|| 9| | |~~~~~~~~~~~~~~~|
|| 9| | |~~~~~~~~~~~~~~~|
+| | | |~~~~~~~~~~~~~~~|
| | | |~~~~~~~~~~~~~~~|-> file-I/O into python
+------+ | | all .CSV file to RAM ~ 25+GB SLOWEST/EXPENSIVE
| |***| 3rd time 25+GB file-I/O-s
| | RAM .CSV to df CPU work
| |***| df to LIST new RAM-allocation + list.append( df )-costs
| |***| + 25+GB
| |***|
many hours | |***|
[SERIAL] flow ...| |***|
/\/\/\/\/\/\/\/\/\/|\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\|***|/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
| crashed |***|
| on about |***|
| 360-th file |***|
| |***|->RAM ~ 50~GB with a LIST of all 25+GB dataframes held in LIST
| | CPU +mem-I/O costs LIST to new 25+GB dataframe RAM-allocation & DATA-processing
| |~~~~~| mem-I/O RAM| :: GB
| |~~~~~| mem-I/O |RAM flow of ~ 50+GB over only 2/3/? mem-I/O HW-channels
| |~~~~~| only if "HPC"
| |~~~~~| is *NOT* a "shared"-rental of cloud HW,
| |~~~~~| remarketed as an "HPC"-illusion
| |~~~~~|
| :::::::::::::?
| :::::::::::::?
| :::::::::::::?
| <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
| <...some amount of some usefull work ...>
| <...some amount of some usefull work the more ...>
| <...some amount of some usefull work the better ...>
| <...some amount of some usefull work as ...>
| <...some amount of some usefull work it ...>
| <...some amount of some usefull work dissolves to AWFULLY ...>
| <...some amount of some usefull work HIGH ...>
| <...some amount of some usefull work SETUP COSTS ...>
| <...some amount of some usefull work ...>
| <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
| :::::::::::::?
| :::::::::::::?
| :::::::::::::?
| |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
| |~~~~~| 4th time 25+GB file-I/O-s
| |~~~~~|
| |~~~~~|->file left on remote storage (?)
| |
| O?R
| |
| |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
| |~~~~~| 5th time 25+GB file-I/O-s
| |~~~~~|
| |~~~~~|
| |~~~~~|
| |~~~~~|-> RAM / CPU ssh()-encrypt+encapsulate-process
| |????????????| 25+GB of results for repatriation
| |????????????| on localhost
| |????????????|
| |????????????|
| |????????????|-> LAN SLOW
| | WAN SLOWER
| | transfer of 30+GB from "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
| | | 30+GB file-I/O ~ 25+GB SLOWEST/EXPENSIVE
| |~~~~~~~~~~~~~~~| 6th time 25+GB file-I/O-s
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
SUCCESS ? | |~~~~~~~~~~~~~~~|-> file transferred back and stored on localhost storage
after |
how many |
failed |
attempts |
having |
how high |
recurring|
costs |
for any |
next |
model|
recompute|
step(s) |
|
|
All |
that |
( at what overall |
[TIME]-domain |
& "HPC"-rental |
costs ) |
Tips :
- review and reduce, where possible, expensive data-items representation ( avoid using int64, where 8-bits are enough, packed bitmaps can help a lot )
- precompute on localhost all items, that could be precomputed ( avoiding repetitive steps )
- join the such "reduced" CSV-files, using a trivial O/S command, into a single input
- compress all data before transports ( a few orders of magnitude saved )
- prefer to code your computing using such algorithms' formulation, that can stream-process items along the data-flow, i.e. not waiting to load all in-RAM to next compute an average or similar trivial on-the-fly stream-computable values ( like
.mean(), .sum(), .min(), .max() or even .rsi(), .EMA(), .TEMA(), .BollingerBands() ... and many more alike ) - stream-computing formulated algorithms reduce both the RAM-allocations, can be & shall be pre-computed (once) & minimise the [SERIAL]-one-after-another processing pipeline latency )
- if indeed in a need to use pandas and fighting on overall physical RAM-ceilings, may try smart
numpy-tools instead, where all the array syntax & methods remain the same, yet it can, by-design, work without moving all data at once from disk into physical RAM ( using this was my life-saving trick since ever, the more when using many-model simulations & HyperParameterSPACE optimisations on a few tens of GB data on 32-bit hardware )
For more details on going into the direction of RAM-protecting memory-mapped np.ndarray processing, with all smart numpy-vectorised and all other high performance-tuned tricks, read more details in this :
>> print( np.memmap.__doc__ )
Create a memory-map to an array stored in a *binary* file on disk.
Memory-mapped files are used for accessing small segments of large files
on disk, without reading the entire file into memory. NumPy's
memmap's are array-like objects. (...)
pd.concatallocates an additional space so to copy the output CSV so you need twice the space. Finally, computing machine like some PC have a swap memory. Swap memory is an additional amount of virtual memory that is stored in (much much slower) storage devices. When there is not enough space, the OS becomes very slow and kill some processes (critical case).df? If so, then the answer of OmidRoshani does the job (although there is probably a faster way to do that). Note that you should calldel dfafter its loop so to reduce the memory footprint. Note that this operation is slow because of file storages and CSV parsing.