Indeed. This may stem from a mental model most of us who don’t write code for ML tools have about what it does, and how realistic fixing the last 20% is, along with a mismatch on how these models get trained relative to the failure cases.
Overall, this development curve is not linear, as one might think initially, a straight line that starts bottom left and goes 45 degree up to the right. Instead it’s more a logarithmic curve, which ascends quick, and then slows down reaching 100% at infinity. Every incremental percent gets harder and harder, and thus takes longer.
In a simplistic view of ML tools, it’s a neural network that at some point you have shown pairs of data - an input and a ‘ground truth’ or matching result. And you let it chew on for a while. Eventually you will be able to give it a new input, and it will pick one of the original results and say with fairly high probability, that’s what you just gave me. What sets ML tools apart from older procedural algorithms is that this works even with an input the model hasn’t seen before. It can tolerate the similarity, and respond with probability. And it can do this quickly, and over very large data sets depending on how the network has been structured. Not unlike humans having seen letters and words and having listened to parents tell us what they mean, and eventually you can read on your own. After all neural networks are patterned after how our brains work (hence the name).
When it comes to image processing, these algorithms don’t generally work on the whole image, but divide the image up into very small tiles and process them separately. They also pre-process the tiles through layers of convolution matrices to simplify the images into easier to recognize patterns.
Where our expectations break seems to be on two fronts:
As a Flame artist who has to do something to a shot, you rely on your experience and say ‘I’ve done this before, I know exactly which nodes I need to use to solve this’. That’s the same as an ML tool. However, we’ve all been in the situation where we do our tried and true technique, only to hit a roadblock, because that shot doesn’t track well, there is some distortion in it, you name it. As an experienced Flame artist you then go and say, well, crap, so method A doesn’t work, but I had that once before, and method B worked instead. And on exceptionally hard shots you might finally get there with method H.
And if you don’t have that much experience yet, you may run out of ideas after D, and you get on Logik and ask and learn a new method until you also can make it to H and solve the shot. So as an artist you learned a new method. Note the word learning here.
But this is a higher level learning than what ML tools do today. ML tools cannot go from A to B to C. They will just give you the best A they have and then stop. Famously, if you take a Waymo self-driving taxi in SFO and it encounters a sharp turn it doesn’t know how to handle, it simply stops in the middle of the street and waits for a rescue.
So when we hear these people talk about ‘the tools will get better’, they make it sound like you just have to train these tools a bit more or tweak them and they will get to 90 or 98%. But I’m afraid that’s a wrong perspective. The last 20% aren’t about more training of the same kind. The last 20% are about the ability to realize that no extra training will solve this, and that you need a different approach, and being trained to find different approaches. Kind of learning of the 2nd degree, not the first 1st degree. And that is a whole different can of worms.
One day we may get there, but not with the current breed of tools. Those would be ML tools that build Flame setups, not take image A and spit out image B (image being maybe a roto mask). They would also have to be able to gauge quality of the result, to see if method B or C was good enough, if they have build a new setup F to meet the quality bar. As a human artist, we apply our sensibility to the result to see if it worked or didn’t. And we interact with the client, and their famously unstructured and at times illogical feedback. ML tools of the 2nd and 3rd degree have to manage this process flow, not mapping image A to image B.
The second front that I think is missing in the current batch of tools, is that the incremental training doesn’t have a good feedback loop. Much has been known and written that AI tools can lead to a lot of bias and discrimination (in social/societal use cases), because they understand what they’ve been trained on, but have blind spots to minorities and edge cases. If your training set is not representative of the problems you’re trying to solve, they will never give satisfactory answers. But as you need large training sets for high accuracy, it’s becoming increasingly more difficult to find and construct good training sets that cover minorities and edge cases. By definition their numbers are limited. That’s another logarithmic problem. You need more of them, and there are fewer available.
By nature ML tools struggle at the edges of the spectrum. Flame artists on the other end make their money solving the edge cases of post finishing, not the everyday simple stuff.
We all know that if something doesn’t work, you have to focus on the broken part, not the part that’s good. We file bugs on Flame for corner cases the devs haven’t thought about. Well, to improve the training of these models, ideally every time a model fails us, we would manually solve this case through other means, and then put the original and the solve back into the model, so in the future it would know what to do.
But the way the tools get trained today, you have some folks doing the training with limited knowledge of the shots we’re solving. And they certainly don’t get feedback in volume about the edge cases, because there’s no method and incentive structure to do that. So they just keep training on more of the ‘wrong’ or ‘generic’ stuff, meaning the models are unlikely to get better for the corner cases we encounter. We just get these black boxes of models and then it’s take it or leave it. Which is why Nuke’s CopyCat is fascinating, because it’s your training, not some black box.
As long as most of these tools chase consumers and Premiere PrEditors, these models aren’t good for high end finishing and never will be. But the money and the wow factor is in the consumer tools, so money will chase those, not our use cases.
Case in point is Flame’s semantic keyer. Eye whites only work reliably if it can see both ears in the shot. Apparently the training data had a lot of front facing faces, but very few 45 degree turned faces. There’s no reason ML tools couldn’t figure out single ear visible faces in principle. Now it maybe that the error becomes unacceptably high because there are too many false positives/negatives in the current setup of the network and processing. But more likely it was a training data set issue.
So when we hear the ‘tools will get better with time’, that would only be true if there was a concerted effort to take the failing edge cases and play ‘whack a mole’ and eliminate them one by one until most are no longer edge cases. But I don’t really see that happening in the current breed of ML models and who develops them.
Until then, we are fascinated, try them out until they fail us, and then they might be a tentative A in our list, but we quickly move to B and on. And as humans we have the capability of doing so. This is why they pay us the big bucks as the saying goes.
Having said all this, what Adobe showed off is definitely inspiring and a massive achievement. It can make our life easier in a few simple shots, but it will not fundamentally change our jobs because in the end humans still have a higher level of training than these point solutions ML tools of the current generation. None of us has to fear for their jobs. The same cannot be said for VO artists and some other processes current generation ML tools can most definitely do as well, faster, and thus cheaper than humans.
You must keep running to stay ahead of the curve. Stay curious. Keep learning.