I have changed my data tool from xarray to polars in recent, and use pl.DataFrame.to_torch() to generate tensor for training my Pytorch model. Data source's format is parquet file.
For avoiding fork child processes, I use torch.multiprocessing.spawn to start my training process, however the process crashed with this:
/home/username/.conda/envs/torchhydro1/bin/python3.11 -X pycache_prefix=/home/username/.cache/JetBrains/IntelliJIdea2024.3/cpython-cache /home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --port 29781 --file /home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py
Console output is saving to: /home/username/torchhydro/experiments/results/train_gnn_ddp.txt
[20:38:51] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:38:52] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
update config file
!!!!!!NOTE!!!!!!!!
-------Please make sure the PRECIPITATION variable is in the 1st location in var_t setting!!---------
If you have POTENTIAL_EVAPOTRANSPIRATION, please set it the 2nd!!!-
!!!!!!NOTE!!!!!!!!
-------Please make sure the STREAMFLOW variable is in the 1st location in var_out setting!!---------
[20:39:04] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:39:06] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
……
Torch is using cuda:0
[2024-12-12 20:48:08,931] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'}
[W CUDAAllocatorConfig.h:30] Warning: expandable_segments not supported on this platform (function operator())
using 8 workers
Pin memory set to True
0%| | 0/22986 [00:00<?, ?it/s]
[20:48:40] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:48:41] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:49:28] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:49:29] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:50:19] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:50:20] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:51:12] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:51:13] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:52:07] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:52:09] DEBUG Using selector: EpollSelector selector_events.py:54
……
[20:52:13] DEBUG CACHEDIR=/home/username/.cache/matplotlib __init__.py:341
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:53:11] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:53:12] DEBUG Using selector: EpollSelector selector_events.py:54
……
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
[20:55:12] DEBUG No module named 'forge' signatures.py:43
DEBUG No module named 'forge' signatures.py:43
[20:55:14] DEBUG Using selector: EpollSelector selector_events.py:54
……
[20:55:19] DEBUG CACHEDIR=/home/username/.cache/matplotlib __init__.py:341
DEBUG Using fontManager instance from font_manager.py:1580
/home/username/.cache/matplotlib/fontl
ist-v390.json
Traceback (most recent call last):
File "/home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/pydevd.py", line 1570, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py", line 171, in <module>
test_run_model()
File "/home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py", line 56, in test_run_model
mp.spawn(gnn_train_worker, args=(world_size, config_data, None), nprocs=world_size, join=True)
File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
python-BaseException
Traceback (most recent call last):
File "/home/username/.conda/envs/torchhydro1/lib/python3.11/multiprocessing/spawn.py", line 132, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: pickle data was truncated
python-BaseException
/home/username/.conda/envs/torchhydro1/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Now I have 2 problems:
First, why it will appears _pickle.UnpicklingError?
Second, after executing 0%| | 0/22986 [00:00<?, ?it/s], there is 7 ……s in my process log, means that this DEBUG process has been repeated for 8 or 9 times! I have set num_worker of pytorch DataLoader to 8, does this problem have connection with num_worker?
This problem occurs after I'm using polars, so I think problem comes from polars, or threads in polars and pytorch have some mistakes.
But how to know why there is UnpicklingError and solve it? Hope for your reply.