“Inference time is around 2 hours for 8 seconds video running on 192gb of Vram”
But are those the two extreme ends of the spectrum - on one end the basic 8bit SDR mp4 workflow, and on the other end full-res, full DR?
It seems that the may be some middle ground. And we don’t have to solve all the problems equally.
The first one would be to get to 10bit rather than 8bit. Doesn’t have to be 16f of 32f.
The second one would be to handle other color spaces (log / linear). That’s trickier since the encoding has to match the model training data. Could be worked around with transforms if there’s sufficient bit depth (see #1).
Lastly resolution - we need more than 720p, but it doesn’t have to be 12K either.
How do we raise the bar so it integrates better, but remains reasonable to process both time and cost wise?