So you think AI isn't going to take your job?

allklier · April 9, 2024, 9:05pm

Yes, I’ve been flagging this for some time.

Not only low-rez internet shit. But everything is Rec709 8bit.

If you’re just trying a mashup of Rec709 8bit internet shit, it is welcome to the party.

But if you’re trying to do something that will cut nicely with well shot Alexa35 Log-C and go through the full color pipeline, good luck. Same if you need to deliver something that is finished in Dolby Vision / HDR.

Any detail that would would have to be there to look half-way decent, will be invented, as it’s not coming from the training material.

If they can’t even find enough Rec709 crap on YouTube, they will definitely not get their hands one enough LogC material to train for high-end content.

This has a long way to go.

ytf · April 9, 2024, 9:08pm

Yes, we were talking about this on Sunday.

cnoellert · April 10, 2024, 12:50am

Someone needs to get all pro-UBI folks on the data-lease train. Basically, individuals allow their data to be mined by big-tech in return for a sizable monthly stipend. Kill two stones with one bird.

andy_dill · April 10, 2024, 1:49am

Haha. Ever piece of writing it makes afterwards is just about smoking weed and chilling with your buds^* because that’s what all the harvested data says.

_{* because this is what I’d do if I had UBI. I’m assuming everyone else would do similar. The AI might also write about road trips and gardening.}

snacks · April 10, 2024, 2:28am

and video games. Don’t forget the video games.

bryanb · April 10, 2024, 11:19am

Algorithms generating data to train other algorithms?

From what I’ve read this does not work (so far). When you train models with models you get really crazy hallucinations and divergence.

onlycarlyouknow · April 15, 2024, 10:11pm

Judging by the examples in this article I’m feeling more secure in my job. Woof.

randy · April 21, 2024, 2:51pm

Wow! If you read/watch/listen to only one thing about AI, this ought to be it. This is from The Daily, The New York Time’s Podcast.

https://www.nytimes.com/2024/04/16/podcasts/the-daily/ai-data.html?

Transcript from NYTimes.com is here…:

This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors. Please review the episode audio before quoting from this transcript and email transcripts@nytimes.com with any questions.

Michael Barbaro

From “The New York Times,” I’m Michael Barbaro. This is “The Daily.”

[MUSIC PLAYING]

Today, a “Times” investigation shows how as the country’s biggest technology companies race to build powerful new artificial intelligence systems, they bent and broke the rules from the start.

My colleague Cade Metz on what he uncovered.

[MUSIC PLAYING]

It’s Tuesday, April 16th.

Cade, when we think about all the artificial intelligence products released over the past couple of years, including, of course, these chatbots we’ve talked a lot about on the show, we so frequently talk about their future their future capabilities, their influence on society, jobs, our lives. But you recently decided to go back in time to AI’s past, to its origins to understand the decisions that were made, basically, at the birth of this technology. So why did you decide to do that?

Cade Metz

Because if you’re thinking about the future of these chatbots, that is defined by their past. The thing you have to realize is that these chatbots learn their skills by analyzing enormous amounts of digital data.

Michael Barbaro

Mm-hmm.

Cade Metz

So what my colleagues and I wanted to do with our investigation was really focus on that effort to gather more data. We wanted to look at the type of data these companies were collecting, how they were gathering it, and how they were feeding it into their systems.

Michael Barbaro

And when you all undertake this line of reporting, what do you end up finding?

Cade Metz

We found that three major players in this race OpenAI, Google, and Meta as they were locked into this competition to develop better and better artificial intelligence, they were willing to do almost anything to get their hands on this data, including ignoring, and in some cases, violating corporate rules and wading into a legal gray area as they gathered this data.

Michael Barbaro

Basically, cutting corners.

Cade Metz

Cutting corners left and right.

Michael Barbaro

OK, let’s start with OpenAI, the flashiest player of all.

Cade Metz

The most interesting thing we’ve found, is that in late 2021, as OpenAI, the startup in San Francisco that built ChatGPT, as they were pulling together the fundamental technology that would power that chatbot, they ran out of data, essentially.

Michael Barbaro

Hmm.

Cade Metz

They had used just about all the respectable English language text on the internet to build this system. And just let that sink in for a bit.

Michael Barbaro

I mean, I’m trying to let that sink in. They basically, like a Pac-Man on an old game, just consumed almost all the English words on the internet, which is kind of unfathomable.

Cade Metz

Wikipedia articles by the thousands, news articles, Reddit threads, digital books by the millions. We’re talking about hundreds of billions, even trillions of words.

Michael Barbaro

Wow.

Cade Metz

So by the end of 2021, OpenAI had no more English language texts that they could feed into these systems, but their ambitions are such that they wanted even more.

[MUSIC PLAYING]

So here, we should remember that if you’re gathering up all the English language text on the internet, a large portion of that is going to be copyrighted.

Michael Barbaro

Right.

Cade Metz

So if you’re one of these companies gathering data at that scale, you are absolutely gathering copyrighted data, as well.

Michael Barbaro

Which suggests that, from the very beginning, these companies, a company like OpenAI with ChatGPT, is starting to break, bend the rules.

Cade Metz

Yes, they are determined to build this technology thus they are willing to venture into what is a legal gray area.

Michael Barbaro

So given that, what does OpenAI do once it, as you had said, runs out of English language words to mop up and feed into this system?

Cade Metz

So they get together, and they say, all right, so what are other options here? And they say, well, what about all the audio and video on the internet? We could transcribe all the audio and video, turn it into text, and feed that into their systems.

Michael Barbaro

Interesting.

Cade Metz

So a small team at OpenAI, which included its president and co-founder Greg Brockman, built a speech-recognition technology called Whisper, which could transcribe audio files into text with high accuracy.

Michael Barbaro

Hmm.

Cade Metz

And then they gathered up all sorts of audio files, from across the internet, including audio books, podcasts —

Michael Barbaro

Oy.

Cade Metz

— and most importantly, YouTube videos.

Michael Barbaro

Hmm, of which there’s a seemingly endless supply, right? Fair to say maybe tens of millions of videos.

Cade Metz

According to my reporting, we’re talking about at least 1,000,000 hours of YouTube videos were scraped off of that video sharing site, fed into this speech recognition system in order to produce new text for training OpenAI’s chatbot. And YouTube’s terms of service do not allow a company like OpenAI to do this. YouTube, which is owned by Google, explicitly says you are not allowed to, in internet parlance, scrape videos en masse from across YouTube and use those videos to build a new application.

That is exactly what OpenAI did. According to my reporting, employees at the company knew that it broke YouTube terms of service, but they resolved to do it anyway.

Michael Barbaro

So, Cade, this makes me want to understand what’s going on over at Google, which as we have talked about in the past on the show, is itself, thinking about and developing its own artificial intelligence model and product.

Cade Metz

Well, as OpenAI scrapes up all these YouTube videos and starts to use them to build their chatbot, according to my reporting, some employees at Google, at the very least, are aware that this is happening.

Michael Barbaro

They are?

Cade Metz

Yes, now when we went to the company about this, a Google spokesman said it did not know that OpenAI was scraping YouTube content and said the company takes legal action over this kind of thing when there’s a clear reason to do so. But according to my reporting, at least some Google employees turned a blind eye to OpenAI’s activities because Google was also using YouTube content to train its AI.

Michael Barbaro

Wow.

Cade Metz

So if they raise a stink about what OpenAI is doing, they end up shining a spotlight on themselves. And they don’t want to do that.

Michael Barbaro

I guess I want to understand what Google’s relationship is to YouTube. Because of course, Google owns YouTube. So what is it allowed or not allowed to do when it comes to feeding YouTube data into Google’s AI models?

Cade Metz

It’s an important distinction. Because Google owns YouTube, it defines what can be done with that data. And Google argues that it has a right to that data, that its terms of service allow it to use that data. However, because of that copyright issue, because the copyright to those videos belong to you and I, lawyers who I’ve spoken to say, people could take Google to court and try to determine whether or not those terms of service really allow Google to do this. There’s another legal gray area here where, although Google argues that it’s OK, others may argue it’s not.

Michael Barbaro

Of course, what makes this all so interesting is, you essentially have one tech company Google, keeping another tech company OpenAI’s dirty little secret about basically stealing from YouTube because it doesn’t want people to know that it too is taking from YouTube. And so these companies are essentially enabling each other as they simultaneously seem to be bending or breaking the rules.

Cade Metz

What this shows is that there is this belief, and it has been there for years within these companies, among their researchers, that they have a right to this data because they’re on a larger mission to build a technology that they believe will transform the world.

Michael Barbaro

Hmm.

Cade Metz

And if you really want to understand this attitude, you can look at our reporting from inside Meta.

Michael Barbaro

And so what does Meta end up doing, according to your reporting?

Cade Metz

Well, like Google and other companies, Meta had to scramble to build artificial intelligence that could compete with OpenAI. Mark Zuckerberg is calling engineers and executives at all hours pushing them to acquire this data that is needed to improve the chatbot.

Michael Barbaro

Mm-hmm.

Cade Metz

And at one point, my colleagues and I got hold of recordings of these Meta executives and engineers discussing this problem. How they could get their hands on more data where they should try to find it? And they explored all sorts of options.

They talked about licensing books, one by one, at $10 a pop and feeding those into the model.

Michael Barbaro

Wow.

Cade Metz

They even discussed acquiring the book publisher Simon & Schuster and feeding its entire library into their AI model. But ultimately, they decided all that was just too cumbersome, too time consuming, and on the recordings of these meetings, you can hear executives talk about how they were willing to run roughshod over copyright law and ignore the legal concerns and go ahead and scrape the internet and feed this stuff into their models.

They acknowledged that they might be sued over this. But they talked about how OpenAI had done this before them. That they, Meta were just following what they saw as a market precedent.

Michael Barbaro

Interesting, so they go from having conversations like, should we buy a publisher that has tons of copyrighted material suggesting that they’re very conscious of the kind of legal terrain and what’s right and what’s wrong. And instead say, nah, let’s just follow the OpenAI model, that blueprint and just do what we want to do, do what we think we have a right to do, which is to kind of just gobble up all this material across the internet.

Cade Metz

It’s a snapshot of that Silicon Valley attitude that we talked about. Because they believe they are building this transformative technology, because they are in this intensely competitive situation where money and power is at stake, they are willing to go there.

Michael Barbaro

But what that means is that there is, at the birth of this technology, a kind of original sin that can’t really be erased.

[MUSIC PLAYING]

Cade Metz

It can’t be erased, and people are beginning to notice. And they are beginning to sue these companies over it. These companies have to have this copyrighted data to build their systems. It is fundamental to their creation. If a lawsuit bars them from using that copyrighted data, that could bring down this technology.

[MUSIC PLAYING]

Michael Barbaro

We’ll be right back.

So Cade, walk us through these lawsuits that are being filed against these AI companies based on the decisions they made early on to use technology as they did and the chances that it could result in these companies not being able to get the data they so desperately say they need.

Cade Metz

These suits are coming from a wide range of places. They’re coming from computer programmers who are concerned that their computer programs have been fed into these systems. They’re coming from book authors who have seen their books being used. They’re coming from publishing companies. They’re coming from news corporations like, “The New York Times,” incidentally, which has filed a lawsuit against OpenAI and Microsoft.

Michael Barbaro

Mm-hmm.

Cade Metz

News organizations that are concerned over their news articles being used to build these systems.

Michael Barbaro

And here, I think it’s important to say as a matter of transparency, Cade, that your reporting is separate from that lawsuit. That lawsuit was filed by the business side of “The New York Times” by people who are not involved in your reporting or in this “Daily” episode, just to get that out of the way.

Cade Metz

Exactly.

Michael Barbaro

I’m assuming that you have spoken to many lawyers about this, and I wonder if there’s some insight that you can shed on the basic legal terrain? I mean, do the companies seem to have a strong case that they have a right to this information, or do companies like the “Times,” who are suing them, seem to have a pretty strong case that, no, that decision violates their copyrighted materials.

Cade Metz

Like so many legal questions, this is incredibly complicated. It comes down to what’s called fair use, which is a part of copyright law that determines whether companies can use copyrighted data to build new things. And there are many factors that go into this. There are good arguments on the OpenAI side. There are good arguments on “The New York Times” side.

Copyright law says that can’t take my work and reproduce it and sell it to someone. That’s not allowed. But what’s called fair use does allow companies and individuals to use copyrighted works in part. They can take snippets of it. They can take the copyrighted works and transform it into something new. That is what OpenAI and others are arguing they’re doing.

But there are other things to consider. Does that transformative work compete with the individuals and companies that supplied the data that owned the copyrights?

Michael Barbaro

Interesting.

Cade Metz

And here, the suit between “The New York Times” company and OpenAI is illustrative. If “The New York Times” creates articles that are then used to build a chatbot, does that chatbot end up competing with “The New York Times?” Do people end up going to that chatbot for their information, rather than going to the “Times” website and actually reading the article? That is one of the questions that will end up deciding this case and cases like it.

Michael Barbaro

So what would it mean for these AI companies for some, or even all of these lawsuits to succeed?

Cade Metz

Well, if these tech companies are required to license the copyrighted data that goes into their systems, if they’re required to pay for it, that becomes a problem for these companies. We’re talking about digital data the size of the entire internet.

Michael Barbaro

Mm-hmm.

Cade Metz

Licensing all that copyrighted data is not necessarily feasible. We quote the venture capital firm Andreessen Horowitz in our story where one of their lawyers says that it does not work for these companies to license that data. It’s too expensive. It’s on too large a scale.

Michael Barbaro

Hmm, it would essentially make this technology economically impractical.

Cade Metz

Exactly, so a jury or a judge or a law ruling against OpenAI, could fundamentally change the way this technology is built. The extreme case is these companies are no longer allowed to use copyrighted material in building these chatbots. And that means they have to start from scratch. They have to rebuild everything they’ve built. So this is something that, not only imperils what they have today, it imperils what they want to build in the future.

Michael Barbaro

And conversely, what happens if the courts rule in favor of these companies and say, you know what, this is fair use. You were fine to have scraped this material and to keep borrowing this material into the future free of charge?

Cade Metz

Well, one significant roadblock drops for these companies. And they can continue to gather up all that extra data, including images and sounds and videos and build increasingly powerful systems. But the thing is, even if they can access as much copyrighted material as they want, these companies may still run into a problem.

Pretty soon they’re going to run out of digital data on the internet.

Michael Barbaro

Hmm.

Cade Metz

That human-created data they rely on is going to dry up. They’re using up this data faster than humans create it. One research organization estimates that by 2026, these companies will run out of viable data on the internet.

Michael Barbaro

Wow. Well, in that case, what would these tech companies do? I mean, where are they going to go if they’ve already scraped YouTube, if they’ve already scraped podcasts, if they’ve already gobbled up the internet and that altogether is not sufficient?

Cade Metz

What many people inside these companies will tell you, including Sam Altman, the chief executive of OpenAI, they’ll tell you that what they will turn to is what’s called synthetic data.

Michael Barbaro

And what is that?

[MUSIC PLAYING]

Cade Metz

That Is data generated by an AI model that is then used to build a better AI model. It’s AI helping to build better AI. That is the vision, ultimately, they have for the future that they won’t need all this human generated text. They’ll just have the AI build the text that will feed future versions of AI.

[MUSIC PLAYING]

Michael Barbaro

So they will feed the AI systems the material that the AI systems themselves create. But is that really a workable solid plan? Is that considered high-quality data? Is that good enough?

Cade Metz

If you do this on a large scale, you quickly run into problems. As we all know, as we’ve discussed on this podcast, these systems make mistakes. They hallucinate . They make stuff up. They show biases that they’ve learned from internet data. And if you start using the data generated by the AI to build new AI, those mistakes start to reinforce themselves.

Michael Barbaro

Right.

Cade Metz

The systems start to get trapped in these cul-de-sacs where they end up not getting better but getting worse.

Michael Barbaro

What you’re really saying is, these AI machines need the unique perfection of the human creative mind.

Cade Metz

Well, as it stands today, that is absolutely the case. But these companies have grand visions for where this will go. And they feel, and they’re already starting to experiment with this, that if you have an AI system that is sufficiently powerful, if you make a copy of it, if you have two of these AI models, one can produce new data, and the other one can judge that data.

It can curate that data as a human would. It can provide the human judgment, So. To speak. So as one model produces the data, the other one can judge it, discard the bad data, and keep the good data. And that’s how they ultimately see these systems creating viable synthetic data. But that has not happened yet, and it’s unclear whether it will work.

Michael Barbaro

It feels like the real lesson of your investigation is that if you have to allegedly steal data to feed your AI model and make it economically feasible, then maybe you have a pretty broken model. And that if you need to create fake data, as a result, which as you just said, kind of undermines AI’s goal of mimicking human thinking and language, then maybe you really have a broken model.

And so that makes me wonder if the folks you talk to, the companies that we’re focused on here, ever ask themselves the question, could we do this differently? Could we create an AI model that just needs a lot less data?

Cade Metz

They have thought about other models for decades. The thing to realize here, is that is much easier said than done. We’re talking about creating systems that can mimic the human brain. That is an incredibly ambitious task. And after struggling with that for decades, these companies have finally stumbled on something that they feel works that is a path to that incredibly ambitious goal.

And they’re going to continue to push in that direction. Yes, they’re exploring other options, but those other options aren’t working.

Michael Barbaro

Hmm.

Cade Metz

What works is more data and more data and more data. And because they see a path there, they’re going to continue down that path. And if there are roadblocks there, and they think they can knock them down, they’re going to knock them down.

Michael Barbaro

But what if the tech companies never get enough or make enough data to get where they think they want to go, even as they’re knocking down walls along the way? That does seem like a real possibility.

Cade Metz

If these companies can’t get their hands on more data, then these technologies, as they’re built today, stop improving.

[MUSIC PLAYING]

We will see their limitations. We will see how difficult it really is to build a system that can match, let alone surpass the human brain.

[MUSIC PLAYING]

These companies will be forced to look for other options, technically. And we will see the limitations of these grandiose visions that they have for the future of artificial intelligence.

[MUSIC PLAYING]

Michael Barbaro

OK, thank you very much. We appreciate it.

Cade Metz

Glad to be here.

[MUSIC PLAYING]

GPM · April 21, 2024, 4:13pm

That was a great episode. Thanks.

BrittCiampa · April 21, 2024, 4:33pm

So it is a very real possibility that what seems to be an exponential growth reaching an asymptote of sky’s the limit potential in which the sky is the furthest reaches of the universe may actually be an s curve, in which slow development sky-rocketed but may pretty soon be reaching the shoulder of a plateau. This is all a good reason to not just blindly listen to the people that say “but in 2 years,” “but in 6 months,” “but in 2 weeks.” It’s frustrating, but not surprising (from the new masters of the most powerful engine of misinformation of all time) that it is so difficult to sift the reality from the investor pitch. Sometimes I’m tempted to be like that Christian Bale dude from the big short and put all my money into shorting AI and then just wait for the possible plateau and all the lawsuits.

I’m going to add one thing because sometimes I feel like people get pretty intense and emotional for whatever reason whenever someone starts the question the reality of this thing and what it will actually amount to and whether all the bad can possibly be outweighed by the good (a little mantra I have to say to keep myself sane and I think it’s a good phrase for introspection regarding this topic): “I’m looking for new tools, not new fetishes.”

onlycarlyouknow · April 22, 2024, 5:46pm

This was a great episode, and I’m really glad they included the constant Michael Barbaro grunts and strange noises he makes while people are talking that drive me up a wall when i actually used to hear the show while driving.

GPM · April 22, 2024, 6:38pm

randy · April 26, 2024, 1:25am

Fxguide article on the need for post work on top of Sora.

pixelmixer · April 26, 2024, 1:57am

I’ve experienced it. It’s literally clean up hell.

andy_dill · April 26, 2024, 4:29am

The Toronto team have been nicknamed “punk-rock Pixar”

BrittCiampa · April 26, 2024, 4:44am

I like to think of myself as the “yacht rock Troma”

johnt · April 26, 2024, 5:17am

It sounds fun. Where do I sign up?

ytf · April 26, 2024, 6:30pm

It shouldn’t be too much of a problem if they can make matching clean plates and a few mattes. If it’s so fucking smart, how hard can that be?

allklier · April 26, 2024, 10:11pm

Great story in the Atlantic:

It goes into the of the decline of America valuing people that make stuff and replaced it with financial engineering of share holder value. We may have finally hit the bottom of the barrel and reversing decades of misguided policy and thirst of glasshouse profits. Worth the read.

And then keep in mind that AI obscures more original work and trades than anything before it. Some talks have even mentioned the mass de-skilling of the workforce.

Now compare that with the take away from the Atlantic story, and any sane human will see the majority of this AI is more of the same horseshit. Im this case the people who know aren’t even offshore, they don’t exist and the computer won’t tell you.

There are no doubt some good use cases for advanced ML when it comes to medicine and predicting various data outcomes. But on the side of GenAI, if we lose how stuff is actually made and let more virtual garbage invade our lives we’ll be sorry for it, and in 10 years people will write this type of story about us.

philm · April 27, 2024, 12:18am

…or in ten years time, machines will author illustrated stories about machines for machines, and we will have already been consigned to an unwritten history…

so at that point there is vodka…

and beaches…

Topic		Replies	Views
Very interesting article on potential AI bubble Between Renders	27	1007	January 2, 2024
Lets talk about AI in Media Between Renders randy-fanclub	23	720	May 5, 2023
Quote AI Jobs Between Renders ai	1	170	July 16, 2024
Machine Learning - implementations and implications Between Renders machine-learning	19	1288	July 20, 2023
Complete Hollywood Shutdown Logik Announcements	25	1131	July 16, 2023

So you think AI isn't going to take your job?

Related topics