Releasing TUNET - A training ML tool

Hello everyone!

I’m excited to release TUNET — a Source/Destination ML training system focused on scalable compute and possibility to export for artists.

I’ve been using TUNET in multiple real projects, and it’s designed to work similar to Copycat or simpleML, but optimized for GPU farms or dedicated machines — so the workstation stays free while training runs elsewhere, and ability to control model settings, data and etc

Once training is complete, the model can be:

  • Directly python inference
  • Exported as a Nuke Cat file
  • Exported as a Flame ONNX model with its JSON sidecar

Link:

Ive tried streamline most as possible the whole project. Im still working in that…

Here you have a How to guide for training:

Flame converter still WIP, not working in all models. Nuke converter is working.

14 Likes

Legend.

1 Like

Thank You @cnoellert glad you liked it.
@ALan missed your post:(

Amazing!!! Thanks, @tpo!

1 Like

all good, I figured out how to run on multiple GPUs. Although a single A6000 (non-ADA) is faster than 3x P6000.

1 Like

yay niceee Alan! yeah DDP is setup and turchrun does the job, nice discovery:)
i have not posted about yet because i like a non-yaml lightwave for multi-gpu, going to publish tomorrow on a separate brench.

Thanks Andyyy!

actually, I do not think multiple GPUs are working.

torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py --config /root/tunetest/train_template_simple.yaml

The log output just sits here:
2025-04-08 02:57:37,182 [RK0][INFO] Checkpoint configuration compatible.
2025-04-08 02:57:37,846 [RK0][INFO] Resuming training from iteration 1501.
2025-04-08 02:57:37,950 [RK0][INFO] Starting training loop from Global Step: 1501 (Epoch: 4)
2025-04-08 02:57:37,950 [RK0][INFO] Starting Epoch 4 (Previous epoch duration: 0.00s) (Dataset Pass approx 4)

If I re-run with 1 node --nproc_per_node=1, then everything progresses as normal. I’ve tested the same procedure on a cloud GPU machine with 2x A100.

see attached screen recording

Nice @ALan and yes that is expected. When running with multi-gpu the current dataloader will stuck the data distribution.
To run this properly at multi-gpu you will need to use the multi-gpu branch, going to publish tomorrow. i let you know here.
Main Branch i will keep for single gpus for now.

Meanwhile, if you grab the Advanced template, you will have more controls to use those VRAM amounts!

2 Likes

Would you like to come on Logik Live and show it? :wink:

12 Likes

Thats so sweet, lets do it @andymilkis , theres is come tricks for training that i could show.

5 Likes

can’t wait to see this one :slight_smile:

2 Likes

I left the training overnight, it got 18,500 steps, (epoch 38). Not sure what is a good amount to stop and judge the results.

2 Likes

Hey @ALan if you open the .jpg is a good way to see. Or you can just grab the latest.pth and python inference the whole plate.

MULTI-GPU support is publised.
just download again from github and is good to go.

Instructions is there or below:

Run trainer for 2 gpus:
torchrun --standalone --nnodes=1 --nproc_per_node=2 train_multigpu.py --config /path/to/your/config.yaml

Run trainer for 4 gpus:
torchrun --standalone --nnodes=1 --nproc_per_node=4 train_multigpu.py --config /path/to/your/config.yaml

Run trainer for 8 gpus:
torchrun --standalone --nnodes=1 --nproc_per_node=8 train_multigpu.py --config /path/to/your/config.yaml

The batch-size is per GPU, so, if you have batch size of 2, running on 8 GPUs, you effective have training at 16 Baches.

4 Likes

OMG!!!

@tpo I’ve gotten the multi-gpu version working, but it is actually faster to train on a single GPU than on multiple GPUs. I’ve tested with 4090 and A100.

2x4090 was about 4.5 seconds/step

2025-04-08 23:11:28,412 [RK0][INFO] Epoch[1], Step[200] (200/500), L1:0.0262(Avg:0.0304), LR:1.0e-04, T/B:4.111s (D:3.283,C:0.806)
2025-04-08 23:11:32,816 [RK0][INFO] Epoch[1], Step[205] (205/500), L1:0.0297(Avg:0.0303), LR:1.0e-04, T/B:4.404s (D:3.591,C:0.791)
2025-04-08 23:11:36,917 [RK0][INFO] Epoch[1], Step[210] (210/500), L1:0.0366(Avg:0.0304), LR:1.0e-04, T/B:4.101s (D:3.292,C:0.787)
2025-04-08 23:11:41,344 [RK0][INFO] Epoch[1], Step[215] (215/500), L1:0.0314(Avg:0.0304), LR:1.0e-04, T/B:4.427s (D:3.613,C:0.792)
2025-04-08 23:11:45,398 [RK0][INFO] Epoch[1], Step[220] (220/500), L1:0.0460(Avg:0.0306), LR:1.0e-04, T/B:4.053s (D:3.242,C:0.789)
2025-04-08 23:11:49,458 [RK0][INFO] Epoch[1], Step[225] (225/500), L1:0.0245(Avg:0.0307), LR:1.0e-04, T/B:4.060s (D:3.245,C:0.793)

1x4090 is about 2.5 seconds/step

2025-04-08 23:08:06,870 [RK0][INFO] Epoch[1] Step[355/500] L1:0.0518 LR:1.00e-04 T/Step:2.588s (D:2.109 P:0.000 C:0.436)
2025-04-08 23:08:09,051 [RK0][INFO] Epoch[1] Step[360/500] L1:0.0530 LR:1.00e-04 T/Step:2.181s (D:1.702 P:0.000 C:0.435)
2025-04-08 23:08:11,232 [RK0][INFO] Epoch[1] Step[365/500] L1:0.0600 LR:1.00e-04 T/Step:2.181s (D:1.702 P:0.000 C:0.435)
2025-04-08 23:08:13,415 [RK0][INFO] Epoch[1] Step[370/500] L1:0.0849 LR:1.00e-04 T/Step:2.182s (D:1.703 P:0.000 C:0.435)

Nicee, glad you got working on your end Alan!

Yes and no.

lets say you have a batch size of 6 on your config.
So 1x GPU every step model compute 6
So 2x GPU every step model compute 12 now.
So you are likelly to have your model ready alot faster (not just 2x) because during training model was able to see more samples in the same time.
makses sense? sometimes also leads to better renders.

In terms of seconds/step there is alot of factors since GPU <> GPU communication, like PCIE bandwidth (8x vs 16x, 3.0, 4,0) and so on. Those will affect heavilly your since GPU needs to sync with GPU.
i dont have the numbers in my mind now i remember on my end be same spped as 1x GPU. But im using NVlinks so is not fair compare.
Going to give a try on a PCIE machine tomorrow.

@tpo

Yeah, all the machines I’ve tested on are PCI based.

Today I was using TensorDock since Vultr has some issues today.

I get that 2(gpu)x6(steps)=12 is better than 1x6=6…
but this is assuming the sec/step is the same across n*(gpu) which according to the log it is not. The log indicates that it is taking around 2x the amount of time with 2 gpu. Maybe the log reporting is inaugurating, and reporting the step time as cumulative and not elapsed?

I think that’s expected since it’s a PCIe machine Alan!

Also, you’re not actually slower—you’re a tiny faster.

Let me break it down using a batch size of 6 as an example:

  • With 1 GPU: it takes 2.5 seconds per step for a batch of 6.
  • With 2 GPUs: it takes 4.5 seconds per step for a batch of 12 (6 per GPU).
    So overall, you’re actually 0.5 seconds faster when using 2 GPUs.

The proper benchmark here would be:
Compare 1 GPU handling a batch of 12 vs. 2 GPUs each handling a batch of 6 (totaling 12).
Does that make sense?

1 Like

I under stand. I didn’t realize that step progress is relative to batch size. I thought a step was absolute. I’ll try that test.