-1

I have changed my data tool from xarray to polars in recent, and use pl.DataFrame.to_torch() to generate tensor for training my Pytorch model. Data source's format is parquet file.

For avoiding fork child processes, I use torch.multiprocessing.spawn to start my training process, however the process crashed with this:

/home/username/.conda/envs/torchhydro1/bin/python3.11 -X pycache_prefix=/home/username/.cache/JetBrains/IntelliJIdea2024.3/cpython-cache /home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --port 29781 --file /home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py 
Console output is saving to: /home/username/torchhydro/experiments/results/train_gnn_ddp.txt
[20:38:51] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:38:52] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
update config file
!!!!!!NOTE!!!!!!!!
-------Please make sure the PRECIPITATION variable is in the 1st location in var_t setting!!---------
If you have POTENTIAL_EVAPOTRANSPIRATION, please set it the 2nd!!!-
!!!!!!NOTE!!!!!!!!
-------Please make sure the STREAMFLOW variable is in the 1st location in var_out setting!!---------
[20:39:04] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:39:06] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
……
Torch is using cuda:0
[2024-12-12 20:48:08,931] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'}
[W CUDAAllocatorConfig.h:30] Warning: expandable_segments not supported on this platform (function operator())
using 8 workers
Pin memory set to True
  0%|          | 0/22986 [00:00<?, ?it/s]
[20:48:40] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:48:41] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:49:28] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:49:29] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:50:19] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:50:20] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:51:12] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:51:13] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:52:07] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:52:09] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
[20:52:13] DEBUG    CACHEDIR=/home/username/.cache/matplotlib   __init__.py:341
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:53:11] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:53:12] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
[20:55:12] DEBUG    No module named 'forge'                     signatures.py:43
           DEBUG    No module named 'forge'                     signatures.py:43
[20:55:14] DEBUG    Using selector: EpollSelector          selector_events.py:54
           ……
[20:55:19] DEBUG    CACHEDIR=/home/username/.cache/matplotlib   __init__.py:341
           DEBUG    Using fontManager instance from         font_manager.py:1580
                    /home/username/.cache/matplotlib/fontl                     
                    ist-v390.json                                               
Traceback (most recent call last):
  File "/home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/pydevd.py", line 1570, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.local/share/JetBrains/IntelliJIdea2024.3/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py", line 171, in <module>
    test_run_model()
  File "/home/username/torchhydro/experiments/train_with_era5land_gnn_ddp.py", line 56, in test_run_model
    mp.spawn(gnn_train_worker, args=(world_size, config_data, None), nprocs=world_size, join=True)
  File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/home/username/.conda/envs/torchhydro1/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
python-BaseException
Traceback (most recent call last):
  File "/home/username/.conda/envs/torchhydro1/lib/python3.11/multiprocessing/spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: pickle data was truncated
python-BaseException
/home/username/.conda/envs/torchhydro1/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Now I have 2 problems:

First, why it will appears _pickle.UnpicklingError?

Second, after executing 0%| | 0/22986 [00:00<?, ?it/s], there is 7 ……s in my process log, means that this DEBUG process has been repeated for 8 or 9 times! I have set num_worker of pytorch DataLoader to 8, does this problem have connection with num_worker?

This problem occurs after I'm using polars, so I think problem comes from polars, or threads in polars and pytorch have some mistakes.

But how to know why there is UnpicklingError and solve it? Hope for your reply.

1 Answer 1

0

It's mistake to filter polars.Dataframe and convert result to torch.Tensor in __get_item__ of torch.Dataset. Convert the whole dataframe to tensor solved the problem.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.