Releasing TUNET - A training ML tool

tpo · April 7, 2025, 10:31pm

Hello everyone!

I’m excited to release TUNET — a Source/Destination ML training system focused on scalable compute and possibility to export for artists.

I’ve been using TUNET in multiple real projects, and it’s designed to work similar to Copycat or simpleML, but optimized for GPU farms or dedicated machines — so the workstation stays free while training runs elsewhere, and ability to control model settings, data and etc

Once training is complete, the model can be:

Directly python inference
Exported as a Nuke Cat file
Exported as a Flame ONNX model with its JSON sidecar

Link:

Ive tried streamline most as possible the whole project. Im still working in that…

Here you have a How to guide for training:

Flame converter still WIP, not working in all models. Nuke converter is working.

cnoellert · April 7, 2025, 10:47pm

Legend.

tpo · April 8, 2025, 2:09am

Thank You @cnoellert glad you liked it.
@ALan missed your post:(

andymilkis · April 8, 2025, 2:10am

Amazing!!! Thanks, @tpo!

ALan · April 8, 2025, 2:10am

all good, I figured out how to run on multiple GPUs. Although a single A6000 (non-ADA) is faster than 3x P6000.

tpo · April 8, 2025, 2:13am

yay niceee Alan! yeah DDP is setup and turchrun does the job, nice discovery:)
i have not posted about yet because i like a non-yaml lightwave for multi-gpu, going to publish tomorrow on a separate brench.

tpo · April 8, 2025, 2:14am

Thanks Andyyy!

ALan · April 8, 2025, 3:03am

actually, I do not think multiple GPUs are working.

torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py --config /root/tunetest/train_template_simple.yaml

The log output just sits here:
2025-04-08 02:57:37,182 [RK0][INFO] Checkpoint configuration compatible.
2025-04-08 02:57:37,846 [RK0][INFO] Resuming training from iteration 1501.
2025-04-08 02:57:37,950 [RK0][INFO] Starting training loop from Global Step: 1501 (Epoch: 4)
2025-04-08 02:57:37,950 [RK0][INFO] Starting Epoch 4 (Previous epoch duration: 0.00s) (Dataset Pass approx 4)

If I re-run with 1 node --nproc_per_node=1, then everything progresses as normal. I’ve tested the same procedure on a cloud GPU machine with 2x A100.

see attached screen recording

tpo · April 8, 2025, 3:29am

Nice @ALan and yes that is expected. When running with multi-gpu the current dataloader will stuck the data distribution.
To run this properly at multi-gpu you will need to use the multi-gpu branch, going to publish tomorrow. i let you know here.
Main Branch i will keep for single gpus for now.

Meanwhile, if you grab the Advanced template, you will have more controls to use those VRAM amounts!

andymilkis · April 8, 2025, 5:24am

Would you like to come on Logik Live and show it?

tpo · April 8, 2025, 10:55am

Thats so sweet, lets do it @andymilkis , theres is come tricks for training that i could show.

george · April 8, 2025, 10:58am

can’t wait to see this one

ALan · April 8, 2025, 5:16pm

I left the training overnight, it got 18,500 steps, (epoch 38). Not sure what is a good amount to stop and judge the results.

tpo · April 8, 2025, 5:51pm

Hey @ALan if you open the .jpg is a good way to see. Or you can just grab the latest.pth and python inference the whole plate.

MULTI-GPU support is publised.
just download again from github and is good to go.

Instructions is there or below:

Run trainer for 2 gpus:
torchrun --standalone --nnodes=1 --nproc_per_node=2 train_multigpu.py --config /path/to/your/config.yaml

Run trainer for 4 gpus:
torchrun --standalone --nnodes=1 --nproc_per_node=4 train_multigpu.py --config /path/to/your/config.yaml

Run trainer for 8 gpus:
torchrun --standalone --nnodes=1 --nproc_per_node=8 train_multigpu.py --config /path/to/your/config.yaml

The batch-size is per GPU, so, if you have batch size of 2, running on 8 GPUs, you effective have training at 16 Baches.

philobedo · April 8, 2025, 8:26pm

OMG!!!

ALan · April 8, 2025, 11:12pm

@tpo I’ve gotten the multi-gpu version working, but it is actually faster to train on a single GPU than on multiple GPUs. I’ve tested with 4090 and A100.

2x4090 was about 4.5 seconds/step

2025-04-08 23:11:28,412 [RK0][INFO] Epoch[1], Step[200] (200/500), L1:0.0262(Avg:0.0304), LR:1.0e-04, T/B:4.111s (D:3.283,C:0.806)
2025-04-08 23:11:32,816 [RK0][INFO] Epoch[1], Step[205] (205/500), L1:0.0297(Avg:0.0303), LR:1.0e-04, T/B:4.404s (D:3.591,C:0.791)
2025-04-08 23:11:36,917 [RK0][INFO] Epoch[1], Step[210] (210/500), L1:0.0366(Avg:0.0304), LR:1.0e-04, T/B:4.101s (D:3.292,C:0.787)
2025-04-08 23:11:41,344 [RK0][INFO] Epoch[1], Step[215] (215/500), L1:0.0314(Avg:0.0304), LR:1.0e-04, T/B:4.427s (D:3.613,C:0.792)
2025-04-08 23:11:45,398 [RK0][INFO] Epoch[1], Step[220] (220/500), L1:0.0460(Avg:0.0306), LR:1.0e-04, T/B:4.053s (D:3.242,C:0.789)
2025-04-08 23:11:49,458 [RK0][INFO] Epoch[1], Step[225] (225/500), L1:0.0245(Avg:0.0307), LR:1.0e-04, T/B:4.060s (D:3.245,C:0.793)

1x4090 is about 2.5 seconds/step

2025-04-08 23:08:06,870 [RK0][INFO] Epoch[1] Step[355/500] L1:0.0518 LR:1.00e-04 T/Step:2.588s (D:2.109 P:0.000 C:0.436)
2025-04-08 23:08:09,051 [RK0][INFO] Epoch[1] Step[360/500] L1:0.0530 LR:1.00e-04 T/Step:2.181s (D:1.702 P:0.000 C:0.435)
2025-04-08 23:08:11,232 [RK0][INFO] Epoch[1] Step[365/500] L1:0.0600 LR:1.00e-04 T/Step:2.181s (D:1.702 P:0.000 C:0.435)
2025-04-08 23:08:13,415 [RK0][INFO] Epoch[1] Step[370/500] L1:0.0849 LR:1.00e-04 T/Step:2.182s (D:1.703 P:0.000 C:0.435)

tpo · April 9, 2025, 12:25am

Nicee, glad you got working on your end Alan!

Yes and no.

lets say you have a batch size of 6 on your config.
So 1x GPU every step model compute 6
So 2x GPU every step model compute 12 now.
So you are likelly to have your model ready alot faster (not just 2x) because during training model was able to see more samples in the same time.
makses sense? sometimes also leads to better renders.

In terms of seconds/step there is alot of factors since GPU <> GPU communication, like PCIE bandwidth (8x vs 16x, 3.0, 4,0) and so on. Those will affect heavilly your since GPU needs to sync with GPU.
i dont have the numbers in my mind now i remember on my end be same spped as 1x GPU. But im using NVlinks so is not fair compare.
Going to give a try on a PCIE machine tomorrow.

ALan · April 9, 2025, 1:25am

@tpo

Yeah, all the machines I’ve tested on are PCI based.

Today I was using TensorDock since Vultr has some issues today.

I get that 2(gpu)x6(steps)=12 is better than 1x6=6…
but this is assuming the sec/step is the same across n*(gpu) which according to the log it is not. The log indicates that it is taking around 2x the amount of time with 2 gpu. Maybe the log reporting is inaugurating, and reporting the step time as cumulative and not elapsed?

tpo · April 9, 2025, 2:30am

I think that’s expected since it’s a PCIe machine Alan!

Also, you’re not actually slower—you’re a tiny faster.

Let me break it down using a batch size of 6 as an example:

With 1 GPU: it takes 2.5 seconds per step for a batch of 6.
With 2 GPUs: it takes 4.5 seconds per step for a batch of 12 (6 per GPU).
So overall, you’re actually 0.5 seconds faster when using 2 GPUs.

The proper benchmark here would be:
Compare 1 GPU handling a batch of 12 vs. 2 GPUs each handling a batch of 6 (totaling 12).
Does that make sense?

ALan · April 9, 2025, 6:21am

I under stand. I didn’t realize that step progress is relative to batch size. I thought a step was absolute. I’ll try that test.

Topic		Replies	Views
Making Rain - Flame's inference experiment Tools and Tech machine-learning , batch , ai , inference	11	678	April 22, 2025
It's a good day - Flame 2025.1 is here! 🔥🚀 Autodesk News update	46	1283	August 8, 2024
flameSimpleML - Flame Machine Learning Source/Target tool with bespoke training Flame Questions ml	93	3080	February 1, 2025
TUNET is now a app, with a UI, no more cmds Tools and Tech machine-learning , batch , inference	10	758	August 31, 2025
Talkin' Timewarps Tips and Tricks	69	2802	March 18, 2021

Releasing TUNET - A training ML tool

Related topics