“Ok, Veo3. Please make a video of a human counting to ten out loud.”
So much for that, LOL.
The people hoping to save money with AI are probably as bad at counting, so it all fits.
hahahaha
it’s an ex-x
Not sure if it got deleted or what. They left out 5 in the second prompt, but it didn’t make much difference.
It’s still online here:
@Sinan - Pretty soon TLDR will become TLDW.
Agreed.
TL;DW: This one approves what we’ve been talking about. And also has bloopers to show they created 1000+ videos and edited them.
Doesn’t this process just seem like the worst kind of drudgery? Its not like being an editor or supervisor working on a professionally shot spot or show. Its like being an editor or supervisor working on a show where it was shot by a bunch of chimpanzees who got their hands on a camera and everyone once in a while shot shakespeare. Just weeding out and filtering through junk junk junk ok junk junk ok junk fixable junk ok good junk could be better junk junk junk. Like man, what a job.
Its Potemkin production. Just offloading labor that is more and more demeaning and tedious onto humans while the generative AI is hailed as some magically efficient and time saving process that puts an end to human toil.
Bring back Johnny 5!
Note the robot’s bubbling feet.
Sorry for being a little late in responding…
Yes, all written and spoken human communication is a loop of signifieds flattened to signifiers transmitted and received as signifiers decompressed to signifieds on the receiving end. Our transmission modality for precise exchange of low-entropy data has been signifiers shared with other humans (who hopefully know what the signs mean) for maybe 100,000 years or so but @Ryland you basically said as much already.
I only point it out again because what we’re arguably looking at is the intersectionality of semiotics and information theory:
Sender - encoder - channel - decoder - receiver
Both are using signifiers connected to a channel… the difference is an ai decodes the incoming signifiers/data to latent space while the human is transmitting signifieds compressed to signifiers.
On both ends the quality of transmission requires lowering the entropy of the system to as an exacting a quantity as possible, otherwise in the case of ai, the signal isn’t decompressed correctly into latent space and the calico cat you imagined playing the bagpipes is a tabby playing an organ.
Like you pointed out, you can augment with images and sketches and whatever makes sense in order to lower the entropy further. Sounds familiar right? It’s because that’s exactly what humans do today to decrease the entropy of these kinds of multimodal conversations when dealing with other humans, because human’s suffer from a very similar problem to what ai does—although not nearly to the same extent. Humans have a shared collective experience which we we use to decompress our signified, but just because we have a shared collective experience doesn’t mean we’ve experienced that shared experience in exactly the same way. Two enemies on opposite sides of the same battle will have experienced some aspects of the same war very similarly and other aspects completely opposite from each other. They have a collective understanding of war but only up to a point. Granularity matters and mutual understand unprompted only can go so far in decompressing the signs. At some point, the sender needs to refine the communication to break through the barrier and make the data more precise.
So if we translate to something more on topic, a director’s treatment with no images doesn’t get across the exact vision but you still comprehend what the story is about. Stills help convey a huge amount of context regarding the visual aspects undeniably. Those cool gifs that take an eternity to load and play correctly give clarity to aspects pertaining to movement. But without the script, it’s just a fever dream of mood-board masturbation and no exact context and while context can be interpreted by the individual the entropy of the system has increased. It doesn’t mean that these other non-linguistic aspects aren’t important—they obviously are, but a narrative requires narration. Without it, entropy goes way way up and our calico cat becomes a 68 Mustang convertible from a reference pic shot car2car at magic hour on page 132 of 495.
This is the very long way of saying there is no escaping the low-bandwidth curse of human communication—regardless of whether that communication is to another human or if it’s to an ai. Words remain our primary form of conveying data and the fidelity of the transmission matters more and more when the entity on the other end of the line isn’t human and doesn’t have a shared understanding to aid in decompressing words into meanings and eventually into images which become shots. That’s not to say that sketches aren’t crucial in the end. AI requires in-depth, hi-fidelity narratives but also benefits greatly from context, like actors need good scripts accompanied by prompts from their director about “what their motivation is.” In the same way the visual reference data helps a DOP understand a director’s vision.
I imagine in my tiny brain that french people view chat gpt differently.
perhaps cat jay pay tay…
it’s not semiotics - it’s full-otics

My wife, native swedish speaker, refers to it as the GeeBeeTeeShat.
Chinese room. That’s what ai is at the moment, and for quite a while I think.
Completely dig your post here.
From my computer science background, this is a perfect and deep explanation of the problem that is fascinating and warrants further reading.
From a visual artists background, it’s the best explanation yet why asking clients for visual references is always fraught with frustration. They send you an image or a IMDb link. You use it, and during the review they’re puzzled why you didn’t understand their reference.
Yet, what they liked in the image was that nicely saturated yellow, while you were focused on the contrast ratio and highlight treatment of the image.
You may have been better off without reference, because you wouldn’t have set the expectation that you would understand what they’re giving you.
Happened so many times. Last year a DP tried to impress the director being sophisticated by referencing Caravaggio for our color session. But then was to busy doing other things than to attend our reviews. When we did the final watch back he lost it and wanted to re-do it all, because we had picked up on different elements of his reference rather than what he meant. But schedule and budget weren’t on his side, so he was told to leave.
These days, I always say ‘send me a reference and point out what it is that you like about that reference’. That solves the problem half way, but not fully.
Of course with AI, during training, there’s no opportunity to ask that question, nor is there later in the process. The modes just barf things out with the utter confidence of science, yet are so wrong.
In some way maybe calling therm hallucination is wrong. They’re not imagining something we didn’t say, they are just associating poorly in a low information environment. And human communication, and all the prompting, cannot entirely overcome the information mis-alignment.
I want Gemini to explain what it sees in the video and sends that to Veo to generate a video and then to explain what is sees in the video and sends that to Veo… ad nauseam
AI rot, generation loss, deepfried truth…
