-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Hello, I want to try training with 3 GPUs, can you help me set up the repository for this?
I just naively changed the trainer's devices to 3, but it threw errors. Could you help me?
Here are the logs:
INFO - 2023-10-14 21:59:55,428 - distributed_c10d - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
INFO - 2023-10-14 21:59:55,437 - distributed_c10d - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2]
| Name | Type | Params
-------------------------------------------------------------
0 | loss_func | ChamferDistanceL2 | 0
1 | group_devider | Group | 0
2 | mask_generator | Mask | 0
3 | MAE_encoder | TransformerWithEmbeddings | 21.8 M
4 | MAE_decoder | TransformerWithEmbeddings | 7.1 M
5 | increase_dim | Conv1d | 37.0 K
-------------------------------------------------------------
29.0 M Trainable params
0 Non-trainable params
29.0 M Total params
116.023 Total estimated model params size (MB)
miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
rank_zero_warn(
Training: 0it [00:00, ?it/s]INFO - 2023-10-14 21:59:56,950 - backend - multiprocessing start_methods=fork,spawn,forkserver, using: spawn
Traceback (most recent call last):
File "pretrain.py", line 146, in <module>
main(args)
File "pretrain.py", line 134, in main
trainer.fit(model, train_dataloaders=train_loader)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/spawn.py", line 78, in launch
mp.spawn(
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/miniconda3/envs/exppointmae/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/spawn.py", line 101, in _wrapping_function
results = function(*args, **kwargs)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 203, in run
self.on_advance_start(*args, **kwargs)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 254, in on_advance_start
self.trainer._call_callback_hooks("on_train_epoch_start")
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/callbacks/lr_monitor.py", line 170, in on_train_epoch_start
logger.log_metrics(latest_stat, step=trainer.fit_loop.epoch_loop._batches_that_stepped)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn
return fn(*args, **kwargs)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 382, in log_metrics
self.experiment.log({**metrics, "trainer/global_step": step})
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in experiment
return get_experiment() or DummyExperiment()
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn
return fn(*args, **kwargs)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 39, in get_experiment
return fn(self)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 354, in experiment
self._experiment = wandb._attach(attach_id)
File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 762, in _attach
raise UsageError("problem")
wandb.errors.UsageError: problem
wandb: Waiting for W&B process to finish... (failed 1).
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync ExpPoint-MAE/wandb/offline-run-20231014_215940-ghfv9t5o
wandb: Find logs at: ./wandb/offline-run-20231014_215940-ghfv9t5o/logs
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/miniconda3/envs/exppointmae/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "miniconda3/envs/exppointmae/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
(exppointmae) ExpPoint-MAE$ Traceback (most recent call last):
File "<string>", line 1, in <module>
File "miniconda3/envs/exppointmae/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "miniconda3/envs/exppointmae/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
miniconda3/envs/exppointmae/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Metadata
Metadata
Assignees
Labels
No labels