Are you using a Cloud GPU server for this?
~$500k rig on site in a vfx company? improbable…
not to mention storage and networking and power…
but maybe…
but improbable…
Hey @philm @cristhiancordoba yes it is local, incl 800Gb fiber conection and hot storage. You need and you want to be local some clients does not allow or they dont want thire data/footage going to elsewhere on a 3rd party vendor where can end up on datasets or loose control. On those cases, local is the only way to control and lock the clients data.
I posted sometime ago if you scrool up on this post my other older GPU server, 8x H200s.
recently sold to a University.
Is a thing, and i confess, im going way to much haha
I love it.
but your situation is no longer as part of a vfx company.
so there’s that.
Has anyone succeeded running Tunet on MacOS? (sonoma 14.7.6), mackbook pro M1 max.
getting various error with imports. like this one:
Traceback (most recent call last):
File “train.py”, line 33, in
from torch.amp import GradScaler, autocast
ImportError: cannot import name ‘GradScaler’ from ‘torch.amp’ (/opt/miniconda3/envs/tunet/lib/python3.8/site-packages/torch/amp/_init_.py)
Changing the import tofrom torch.cuda.amp import GradScaler, autocastlead to other errors:
Error initializing DDP: module ‘torch._C’ has no attribute ‘_cuda_setDevice’. Check DDP environment variables (RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT, LOCAL_RANK) and NCCL/Gloo.
Traceback (most recent call last):
File “train.py”, line 1206, in
train(config)
File “train.py”, line 578, in train
setup_ddp(); rank = get_rank(); world_size = get_world_size()
File “train.py”, line 62, in setup_ddp
torch.cuda.set_device(local_rank)
File “/opt/miniconda3/envs/tunet/lib/python3.8/site-packages/torch/cuda/_init_.py”, line 408, in set_device
torch.\_C.\_cuda_setDevice(device)
AttributeError: module ‘torch._C’ has no attribute ‘_cuda_setDevice’
The formatting is weird, sorry.
grad_scaler.py is there by the way:
/opt/miniconda3/envs/tunet/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py
So the first message makes sense?
Did I get the wrong install maybe?
when pip installing:
Downloading torchvision-0.17.2-cp38-cp38-macosx_10_13_x86_64.whl.metadata (6.6 kB)
Downloading torchaudio-2.2.2-cp38-cp38-macosx_10_13_x86_64.whl.metadata (6.4 kB)
Downloading torch-2.2.2-cp38-none-macosx_10_9_x86_64.whl (150.6 MB)
Error indicates you are trying to use Linux version into Mac.
Download the multios one or use this git clone directly, just added:
git clone --branch multios --single-branch
MacOS does work, but i was very sad with the speed of the training, is crazy slow.
After search more i get to know that not even Apple uses Apple silicon for training.
Is just too slow for such a task.
Thanks Thiago @tpo !
Makes sense, I did forgot to specify the branch when cloning.
re-installing now
hmm, this time I did
git clone --branch multios --single-branch https://github.com/tpc2233/tunet.git
… but still getting this error:
Traceback (most recent call last):
File “train.py”, line 30, in
from torch.amp import GradScaler, autocast
ImportError: cannot import name ‘GradScaler’ from ‘torch.amp’ (/opt/miniconda3/envs/tunet/lib/python3.8/site-packages/torch/amp/init.py)
Could it be something with pip install torch torchvision torchaudio getting the wrong version? grad_scaller is not directly under torch in my conda env, but inside the cuda folder
Alright, I just changed the import again in train.py (but this time with the proper cloned repo) and it seems to be working (from torch.cuda.amp import GradScaler, autocast)
13:17:59 [INFO] Epoch[1] Step[330] (330/500), L1:0.0433(Avg:0.1289), LR:1.0e-04, T/Step:0.817s (D:0.001 T:0.003 C:0.170)
13:18:03 [INFO] … (D:0.001 T:0.002 C:0.155)
13:18:03 [INFO] Epoch[1] Step[335] (335/500), L1:0.0863(Avg:0.1284), LR:1.0e-04, T/Step:0.817s (D:0.001 T:0.002 C:0.155)
13:18:07 [INFO] … (D:0.001 T:0.001 C:0.163)
13:18:07 [INFO] Epoch[1] Step[340] (340/500), L1:0.0692(Avg:0.1276), LR:1.0e-04, T/Step:0.818s (D:0.001 T:0.001 C:0.163)
13:18:11 [INFO] … (D:0.001 T:0.001 C:0.160)
13:18:11 [INFO] Epoch[1] Step[345] (345/500), L1:0.0893(Avg:0.1268), LR:1.0e-04, T/Step:0.819s (D:0.001 T:0.001 C:0.160)
Not sure if amp is supported in all M chips.
Try disable amp on your config if you are using.
Guessing amp is True by default?
Using a simple config, without setting amp to anything. Seems to be working now (?)
If the results are not as expected I’ll try to set it to False.
EDIT:
the info was in the terminal:
Optimizer: AdamW | AMP Enabled: False
I got it to install, run and convert on Mac and Windows, but it’s not doing what it should.
Rulling out the mac, on a windows10 RTX4090, with datasets at 1164x1620 (crops), with AMP turned on, getting this error:
16:47:09 [WARNING] AMP requested but device is CPU. AMP disabled. 16:47:09 [INFO] Optimizer: AdamW | AMP Enabled: False
Then on CPU it’s like 9s per step
Is this something to set on the windows machine (CPU/GPU) ? What am I doing wrong?
Has anybody else ran into this problem?
I followed finnschi’s solution in the issue’s question and it worked.
about a second per step on this 4090, at 1164x1620 (crops)
@finnjaeger is that you? (finnschi)
finnschi’s solution ===>
conda activate tunet
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
verify that its working:
python -c “import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0)); print(torch.version.cuda)”
===========
(tunet) C:\Users\stefan\tunet>python -c “import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0)); print(torch.version.cuda)”
True
NVIDIA GeForce RTX 4090
11.8
yea thats me, had to do anlot more things to get AMP to work, to get my s/step down, i can send you my fork if you want, but idk if I made things better or worse
Send pls, would like to see if it reduces my training time.
Curious to know where the aov frames are initially generated? Ran the archive and it works as expected, but obviously the results for aov separation are less than ideal when piped a new clip. Thus new training needs to be done but not sure where to create the different AOV’s to do the training on?