Just played around with StableDiffusion for a bit. The results were some of the better ones I’ve seen from the various text-to-image models I’ve played around with. The pace of progess in these AI models is impressive, but I don’t worry about my job security at all.
I think there is something about a computer generating an image from whole cloth that captivates the mind of the average person. It lends itself to a certain level of anthropomorphism since until very recently, the generation of images has been wholy the provenance of humans. This can lead us to believe that AI is “coming for our jobs” as if they were just people in another country where our jobs can be outsourced. They may be coming for our jobs, but not because they can create 512x512 stills based off of text prompts.
It seems to me that in this initial phase of AI text to image generation, the best use case as it applies to our jobs is for matte painting. Its pretty cool to enter a prompt and receive back an image of that prompt, but google image search already does the same thing. Almost everything in the world has been photographed a million times from every angle, in every lighting situation, in every season. I’d still much rather go look for a real photograph than mash every photoreal, sharp, detailed, etc. modifier I can think of into a text to image algorithm and still receive something that I would generously describe as “painterly.”
There is an inherent limitation to any algorithm and that is the data it was trained on. That makes it hard for me to belive AI will ever be able to compete with humans on a creative level. Whereas a human has the capacity to imagine something that has never existed before, an algorithm is unable to create something that isn’t derived from the data it has been fed.
My favorite prompts to feed into these image generators are usually something along the lines of
- The most complex thing ever
- The most intricately detail object in existence
- An object that doesn’t exist
- Something that has never been thought of before
- A photo of nothing
Whereas those prompts are so open ended that you can imagine a person coming up with nearly anything, these image generators will usually return images that are clearly derived from some of the most intricate and detailed objects ever created by people. Think baroque sculpture and architecture, or meticulously crafted gold vases and platters. Or they just completely fall over and generate mush. StableDiffusion seems to like generating meaningless text when it can’t come up with anything else.
From a financial standpoint though, I don’t imagine many execs are thinking, “Damn how can we get our concepting costs down?” and the Alpaca example shows that AI is still just a tool for an artist to use. Actual art direction isn’t going anywhere, nor are the clients who want to change this or that meangingless pixel so they can feel useful.
The actual practical uses of AI are getting far less attention in my mind, because they aren’t as flashy. We’ve already outsourced all of the jobs that will be the first to become automated, for the exact reason they were outsourced in the first place. They are labor intensive, time consuming, and relatively simple tasks. Roto, cleanup, and camera tracking are the most financially viable applications of AI software in the short term and while these tasks are mostly simple, automating them is not. I saw my first demo of AI roto six or seven years ago and its still not in widespread production anywhere that I’ve seen. If its another six or seven years until all roto is generated through AI software then whats the runway for completely replacing far more complicated tasks?
Perhaps this is naive, but the technically hurdles seem so great that I think most people here will be retired by the time there is any substantial threat to their jobs. The computing power alone may be a limiting factor for many of the things we do.
I frequently think about Gravity when AI is being discussed. When it was released I remember seeing a show-and-tell that discussed render times. The compute power was astronomical and that was when the computer was explicitly being told exactly what to do. Now imagine the computer had to do the creative part in addition to all the rendering. Most of these text to image generators are taking several seconds to a minute to generate one very low-res still image. Now complicate that by asking for video, and then complicate it even more by asking for 4k. The resources just don’t exist to entirely replace our jobs. There is a tendency in these conversations to underestimate how much of the “computing” goes on in our brains since its difficult to quantify in exact terms.
As for what kind of AI/ML implementations I’d like to see in Flame, none of it is creative.
- Object identification for roto, automated cleanup, and depth passes with a cryptomatte style selection interface
- Some module akin to openFX or matchboxes that you could load bespoke trained algorithms into like the ML timewarp
- Camera tracking with depth passes and lens distortion corrections
- Some kind of automated paint in-fill
- depth and object aware defocusing
- something like Nuke’s copycat
- degrain and regrain nodes