Wow! If you read/watch/listen to only one thing about AI, this ought to be it. This is from The Daily, The New York Time’s Podcast.
https://www.nytimes.com/2024/04/16/podcasts/the-daily/ai-data.html?
Transcript from NYTimes.com is here…:
This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors. Please review the episode audio before quoting from this transcript and email transcripts@nytimes.com with any questions.
Michael Barbaro
From “The New York Times,” I’m Michael Barbaro. This is “The Daily.”
[MUSIC PLAYING]
Today, a “Times” investigation shows how as the country’s biggest technology companies race to build powerful new artificial intelligence systems, they bent and broke the rules from the start.
My colleague Cade Metz on what he uncovered.
[MUSIC PLAYING]
It’s Tuesday, April 16th.
Cade, when we think about all the artificial intelligence products released over the past couple of years, including, of course, these chatbots we’ve talked a lot about on the show, we so frequently talk about their future their future capabilities, their influence on society, jobs, our lives. But you recently decided to go back in time to AI’s past, to its origins to understand the decisions that were made, basically, at the birth of this technology. So why did you decide to do that?
Cade Metz
Because if you’re thinking about the future of these chatbots, that is defined by their past. The thing you have to realize is that these chatbots learn their skills by analyzing enormous amounts of digital data.
Michael Barbaro
Mm-hmm.
Cade Metz
So what my colleagues and I wanted to do with our investigation was really focus on that effort to gather more data. We wanted to look at the type of data these companies were collecting, how they were gathering it, and how they were feeding it into their systems.
Michael Barbaro
And when you all undertake this line of reporting, what do you end up finding?
Cade Metz
We found that three major players in this race OpenAI, Google, and Meta as they were locked into this competition to develop better and better artificial intelligence, they were willing to do almost anything to get their hands on this data, including ignoring, and in some cases, violating corporate rules and wading into a legal gray area as they gathered this data.
Michael Barbaro
Basically, cutting corners.
Cade Metz
Cutting corners left and right.
Michael Barbaro
OK, let’s start with OpenAI, the flashiest player of all.
Cade Metz
The most interesting thing we’ve found, is that in late 2021, as OpenAI, the startup in San Francisco that built ChatGPT, as they were pulling together the fundamental technology that would power that chatbot, they ran out of data, essentially.
Michael Barbaro
Hmm.
Cade Metz
They had used just about all the respectable English language text on the internet to build this system. And just let that sink in for a bit.
Michael Barbaro
I mean, I’m trying to let that sink in. They basically, like a Pac-Man on an old game, just consumed almost all the English words on the internet, which is kind of unfathomable.
Cade Metz
Wikipedia articles by the thousands, news articles, Reddit threads, digital books by the millions. We’re talking about hundreds of billions, even trillions of words.
Michael Barbaro
Wow.
Cade Metz
So by the end of 2021, OpenAI had no more English language texts that they could feed into these systems, but their ambitions are such that they wanted even more.
[MUSIC PLAYING]
So here, we should remember that if you’re gathering up all the English language text on the internet, a large portion of that is going to be copyrighted.
Michael Barbaro
Right.
Cade Metz
So if you’re one of these companies gathering data at that scale, you are absolutely gathering copyrighted data, as well.
Michael Barbaro
Which suggests that, from the very beginning, these companies, a company like OpenAI with ChatGPT, is starting to break, bend the rules.
Cade Metz
Yes, they are determined to build this technology thus they are willing to venture into what is a legal gray area.
Michael Barbaro
So given that, what does OpenAI do once it, as you had said, runs out of English language words to mop up and feed into this system?
Cade Metz
So they get together, and they say, all right, so what are other options here? And they say, well, what about all the audio and video on the internet? We could transcribe all the audio and video, turn it into text, and feed that into their systems.
Michael Barbaro
Interesting.
Cade Metz
So a small team at OpenAI, which included its president and co-founder Greg Brockman, built a speech-recognition technology called Whisper, which could transcribe audio files into text with high accuracy.
Michael Barbaro
Hmm.
Cade Metz
And then they gathered up all sorts of audio files, from across the internet, including audio books, podcasts —
Michael Barbaro
Oy.
Cade Metz
— and most importantly, YouTube videos.
Michael Barbaro
Hmm, of which there’s a seemingly endless supply, right? Fair to say maybe tens of millions of videos.
Cade Metz
According to my reporting, we’re talking about at least 1,000,000 hours of YouTube videos were scraped off of that video sharing site, fed into this speech recognition system in order to produce new text for training OpenAI’s chatbot. And YouTube’s terms of service do not allow a company like OpenAI to do this. YouTube, which is owned by Google, explicitly says you are not allowed to, in internet parlance, scrape videos en masse from across YouTube and use those videos to build a new application.
That is exactly what OpenAI did. According to my reporting, employees at the company knew that it broke YouTube terms of service, but they resolved to do it anyway.
Michael Barbaro
So, Cade, this makes me want to understand what’s going on over at Google, which as we have talked about in the past on the show, is itself, thinking about and developing its own artificial intelligence model and product.
Cade Metz
Well, as OpenAI scrapes up all these YouTube videos and starts to use them to build their chatbot, according to my reporting, some employees at Google, at the very least, are aware that this is happening.
Michael Barbaro
They are?
Cade Metz
Yes, now when we went to the company about this, a Google spokesman said it did not know that OpenAI was scraping YouTube content and said the company takes legal action over this kind of thing when there’s a clear reason to do so. But according to my reporting, at least some Google employees turned a blind eye to OpenAI’s activities because Google was also using YouTube content to train its AI.
Michael Barbaro
Wow.
Cade Metz
So if they raise a stink about what OpenAI is doing, they end up shining a spotlight on themselves. And they don’t want to do that.
Michael Barbaro
I guess I want to understand what Google’s relationship is to YouTube. Because of course, Google owns YouTube. So what is it allowed or not allowed to do when it comes to feeding YouTube data into Google’s AI models?
Cade Metz
It’s an important distinction. Because Google owns YouTube, it defines what can be done with that data. And Google argues that it has a right to that data, that its terms of service allow it to use that data. However, because of that copyright issue, because the copyright to those videos belong to you and I, lawyers who I’ve spoken to say, people could take Google to court and try to determine whether or not those terms of service really allow Google to do this. There’s another legal gray area here where, although Google argues that it’s OK, others may argue it’s not.
Michael Barbaro
Of course, what makes this all so interesting is, you essentially have one tech company Google, keeping another tech company OpenAI’s dirty little secret about basically stealing from YouTube because it doesn’t want people to know that it too is taking from YouTube. And so these companies are essentially enabling each other as they simultaneously seem to be bending or breaking the rules.
Cade Metz
What this shows is that there is this belief, and it has been there for years within these companies, among their researchers, that they have a right to this data because they’re on a larger mission to build a technology that they believe will transform the world.
Michael Barbaro
Hmm.
Cade Metz
And if you really want to understand this attitude, you can look at our reporting from inside Meta.
Michael Barbaro
And so what does Meta end up doing, according to your reporting?
Cade Metz
Well, like Google and other companies, Meta had to scramble to build artificial intelligence that could compete with OpenAI. Mark Zuckerberg is calling engineers and executives at all hours pushing them to acquire this data that is needed to improve the chatbot.
Michael Barbaro
Mm-hmm.
Cade Metz
And at one point, my colleagues and I got hold of recordings of these Meta executives and engineers discussing this problem. How they could get their hands on more data where they should try to find it? And they explored all sorts of options.
They talked about licensing books, one by one, at $10 a pop and feeding those into the model.
Michael Barbaro
Wow.
Cade Metz
They even discussed acquiring the book publisher Simon & Schuster and feeding its entire library into their AI model. But ultimately, they decided all that was just too cumbersome, too time consuming, and on the recordings of these meetings, you can hear executives talk about how they were willing to run roughshod over copyright law and ignore the legal concerns and go ahead and scrape the internet and feed this stuff into their models.
They acknowledged that they might be sued over this. But they talked about how OpenAI had done this before them. That they, Meta were just following what they saw as a market precedent.
Michael Barbaro
Interesting, so they go from having conversations like, should we buy a publisher that has tons of copyrighted material suggesting that they’re very conscious of the kind of legal terrain and what’s right and what’s wrong. And instead say, nah, let’s just follow the OpenAI model, that blueprint and just do what we want to do, do what we think we have a right to do, which is to kind of just gobble up all this material across the internet.
Cade Metz
It’s a snapshot of that Silicon Valley attitude that we talked about. Because they believe they are building this transformative technology, because they are in this intensely competitive situation where money and power is at stake, they are willing to go there.
Michael Barbaro
But what that means is that there is, at the birth of this technology, a kind of original sin that can’t really be erased.
[MUSIC PLAYING]
Cade Metz
It can’t be erased, and people are beginning to notice. And they are beginning to sue these companies over it. These companies have to have this copyrighted data to build their systems. It is fundamental to their creation. If a lawsuit bars them from using that copyrighted data, that could bring down this technology.
[MUSIC PLAYING]
Michael Barbaro
We’ll be right back.
So Cade, walk us through these lawsuits that are being filed against these AI companies based on the decisions they made early on to use technology as they did and the chances that it could result in these companies not being able to get the data they so desperately say they need.
Cade Metz
These suits are coming from a wide range of places. They’re coming from computer programmers who are concerned that their computer programs have been fed into these systems. They’re coming from book authors who have seen their books being used. They’re coming from publishing companies. They’re coming from news corporations like, “The New York Times,” incidentally, which has filed a lawsuit against OpenAI and Microsoft.
Michael Barbaro
Mm-hmm.
Cade Metz
News organizations that are concerned over their news articles being used to build these systems.
Michael Barbaro
And here, I think it’s important to say as a matter of transparency, Cade, that your reporting is separate from that lawsuit. That lawsuit was filed by the business side of “The New York Times” by people who are not involved in your reporting or in this “Daily” episode, just to get that out of the way.
Cade Metz
Exactly.
Michael Barbaro
I’m assuming that you have spoken to many lawyers about this, and I wonder if there’s some insight that you can shed on the basic legal terrain? I mean, do the companies seem to have a strong case that they have a right to this information, or do companies like the “Times,” who are suing them, seem to have a pretty strong case that, no, that decision violates their copyrighted materials.
Cade Metz
Like so many legal questions, this is incredibly complicated. It comes down to what’s called fair use, which is a part of copyright law that determines whether companies can use copyrighted data to build new things. And there are many factors that go into this. There are good arguments on the OpenAI side. There are good arguments on “The New York Times” side.
Copyright law says that can’t take my work and reproduce it and sell it to someone. That’s not allowed. But what’s called fair use does allow companies and individuals to use copyrighted works in part. They can take snippets of it. They can take the copyrighted works and transform it into something new. That is what OpenAI and others are arguing they’re doing.
But there are other things to consider. Does that transformative work compete with the individuals and companies that supplied the data that owned the copyrights?
Michael Barbaro
Interesting.
Cade Metz
And here, the suit between “The New York Times” company and OpenAI is illustrative. If “The New York Times” creates articles that are then used to build a chatbot, does that chatbot end up competing with “The New York Times?” Do people end up going to that chatbot for their information, rather than going to the “Times” website and actually reading the article? That is one of the questions that will end up deciding this case and cases like it.
Michael Barbaro
So what would it mean for these AI companies for some, or even all of these lawsuits to succeed?
Cade Metz
Well, if these tech companies are required to license the copyrighted data that goes into their systems, if they’re required to pay for it, that becomes a problem for these companies. We’re talking about digital data the size of the entire internet.
Michael Barbaro
Mm-hmm.
Cade Metz
Licensing all that copyrighted data is not necessarily feasible. We quote the venture capital firm Andreessen Horowitz in our story where one of their lawyers says that it does not work for these companies to license that data. It’s too expensive. It’s on too large a scale.
Michael Barbaro
Hmm, it would essentially make this technology economically impractical.
Cade Metz
Exactly, so a jury or a judge or a law ruling against OpenAI, could fundamentally change the way this technology is built. The extreme case is these companies are no longer allowed to use copyrighted material in building these chatbots. And that means they have to start from scratch. They have to rebuild everything they’ve built. So this is something that, not only imperils what they have today, it imperils what they want to build in the future.
Michael Barbaro
And conversely, what happens if the courts rule in favor of these companies and say, you know what, this is fair use. You were fine to have scraped this material and to keep borrowing this material into the future free of charge?
Cade Metz
Well, one significant roadblock drops for these companies. And they can continue to gather up all that extra data, including images and sounds and videos and build increasingly powerful systems. But the thing is, even if they can access as much copyrighted material as they want, these companies may still run into a problem.
Pretty soon they’re going to run out of digital data on the internet.
Michael Barbaro
Hmm.
Cade Metz
That human-created data they rely on is going to dry up. They’re using up this data faster than humans create it. One research organization estimates that by 2026, these companies will run out of viable data on the internet.
Michael Barbaro
Wow. Well, in that case, what would these tech companies do? I mean, where are they going to go if they’ve already scraped YouTube, if they’ve already scraped podcasts, if they’ve already gobbled up the internet and that altogether is not sufficient?
Cade Metz
What many people inside these companies will tell you, including Sam Altman, the chief executive of OpenAI, they’ll tell you that what they will turn to is what’s called synthetic data.
Michael Barbaro
And what is that?
[MUSIC PLAYING]
Cade Metz
That Is data generated by an AI model that is then used to build a better AI model. It’s AI helping to build better AI. That is the vision, ultimately, they have for the future that they won’t need all this human generated text. They’ll just have the AI build the text that will feed future versions of AI.
[MUSIC PLAYING]
Michael Barbaro
So they will feed the AI systems the material that the AI systems themselves create. But is that really a workable solid plan? Is that considered high-quality data? Is that good enough?
Cade Metz
If you do this on a large scale, you quickly run into problems. As we all know, as we’ve discussed on this podcast, these systems make mistakes. They hallucinate . They make stuff up. They show biases that they’ve learned from internet data. And if you start using the data generated by the AI to build new AI, those mistakes start to reinforce themselves.
Michael Barbaro
Right.
Cade Metz
The systems start to get trapped in these cul-de-sacs where they end up not getting better but getting worse.
Michael Barbaro
What you’re really saying is, these AI machines need the unique perfection of the human creative mind.
Cade Metz
Well, as it stands today, that is absolutely the case. But these companies have grand visions for where this will go. And they feel, and they’re already starting to experiment with this, that if you have an AI system that is sufficiently powerful, if you make a copy of it, if you have two of these AI models, one can produce new data, and the other one can judge that data.
It can curate that data as a human would. It can provide the human judgment, So. To speak. So as one model produces the data, the other one can judge it, discard the bad data, and keep the good data. And that’s how they ultimately see these systems creating viable synthetic data. But that has not happened yet, and it’s unclear whether it will work.
Michael Barbaro
It feels like the real lesson of your investigation is that if you have to allegedly steal data to feed your AI model and make it economically feasible, then maybe you have a pretty broken model. And that if you need to create fake data, as a result, which as you just said, kind of undermines AI’s goal of mimicking human thinking and language, then maybe you really have a broken model.
And so that makes me wonder if the folks you talk to, the companies that we’re focused on here, ever ask themselves the question, could we do this differently? Could we create an AI model that just needs a lot less data?
Cade Metz
They have thought about other models for decades. The thing to realize here, is that is much easier said than done. We’re talking about creating systems that can mimic the human brain. That is an incredibly ambitious task. And after struggling with that for decades, these companies have finally stumbled on something that they feel works that is a path to that incredibly ambitious goal.
And they’re going to continue to push in that direction. Yes, they’re exploring other options, but those other options aren’t working.
Michael Barbaro
Hmm.
Cade Metz
What works is more data and more data and more data. And because they see a path there, they’re going to continue down that path. And if there are roadblocks there, and they think they can knock them down, they’re going to knock them down.
Michael Barbaro
But what if the tech companies never get enough or make enough data to get where they think they want to go, even as they’re knocking down walls along the way? That does seem like a real possibility.
Cade Metz
If these companies can’t get their hands on more data, then these technologies, as they’re built today, stop improving.
[MUSIC PLAYING]
We will see their limitations. We will see how difficult it really is to build a system that can match, let alone surpass the human brain.
[MUSIC PLAYING]
These companies will be forced to look for other options, technically. And we will see the limitations of these grandiose visions that they have for the future of artificial intelligence.
[MUSIC PLAYING]
Michael Barbaro
OK, thank you very much. We appreciate it.
Cade Metz
Glad to be here.
[MUSIC PLAYING]