Hitting the AI wall

Hitting the AI wall

by CBC

Trending Podcast Topics, In Your Inbox

Sign up for Beacon’s free newsletter, and find out about the most interesting podcast topics before everyone else.

Rated 5 stars by early readers

By continuing, you are indicating that you accept our Terms of Service and Privacy Policy.

Topics in this Episode

About This Episode

54:08 minutes

published 1 month ago

Canadian English

Copyright © CBC 2024

Speaker 80s - 33.7s

I'm Dr. Brian Goldman, host of the CBC podcast, The Dose WORK_OF_ART. Each week, we answer health topics in a smart and sometimes counterintuitive way you won't hear anywhere else. Like, what's the least amount of exercise I can do to get the benefits? Which psychedelics can improve my mental health? And how can I check for cancer if I don't have a family doctor? Top experts help me bring you what you need to know in plain language in about 20 minutes.Find the dose on the CBC ORG Listen app or wherever you get your podcast. This is a CBC ORG podcast.

Speaker 737.84s - 40.08s

Hi, I'm Nora Young. This is Spark PRODUCT.

Speaker 942.88s - 111.64s

Over the years, we've talked a lot about the data-driven turn in AI, and how a deep learning approach has given us everything from image recognition to chat GPT. But what about the ongoing ethical questions about the kinds of data machines are learning on? And beyond that, what if we're starting to run out of data? This time, tracking the data limits of AI. Ever since ChatGPT took off, Google, Meta and OpenAI ORG have been in a race to build ever more powerful generative AI systems, systems that rely onenormous amounts of data to train them, especially the kind of human-created high-quality data that large language models like chat GPT need to produce impressive results. But now there's concern that these companies are running out of data to train their new large language models. That high-quality human-produced information is finite, and that the internet isn't the endless source of data they once thought it was.

Speaker 3112.34s - 122.48s

I think that there's a real reason to think that we've maybe reached a period of diminishing returns. So a year ago, it looked like we were maybe on an exponential. Things were rising really

Speaker 9122.48s - 136.2s

fast. This is Gary Marcus PERSON. He's a cognitive scientist and leading voice in artificial intelligence. He's the author of Rebooting AI, Building Artificial Intelligence We Can Trust, and the forthcoming book, Taming Silicon Valley WORK_OF_ART, how we can ensure

Speaker 3136.2s - 151.86s

that AI works for us. Well, I think of large language models as being like bulls in a china shop. They're wild, reckless beasts that do amazing things, but we don't really know how to control them. Back in 2022, Gary PERSON warned that we were nearing this deep learning data

Speaker 9151.86s - 158.5s

wall. And he's also written a lot about the limits of large language models. They're not very good

Speaker 3158.5s - 174.2s

at reasoning. They're not very good at planning. They hallucinate or confabulate might be a better word frequently. And there's also an issue that they're very greedy about data. And we're running up, I think, against the fact that people have already used essentially every bit of data they can get their hands on.

Speaker 9175.24s - 195.96s

A recent piece in the New York Times reported that a team at OpenAI ORG, which included President Greg Brockman, had actually collected and transcribed over a million hours of YouTube ORG videos to train their chat GPT4. Last year, Meta also reportedly discussed acquiring Simon & Schuster ORG to gain access to the publishing houses long-form works.

Speaker 3196.68s - 233.44s

I mean, there's almost a desperation about trying to get more data, and there's not that much more good data. You can always make up bad data. You can have chat GPT, which hallucinates or confabulates make up data, but some of that data is not going to be any good. So there's actually a concern about kind of polluting the internet with bad information. If you plotted things on a graph,on your favorite benchmark, how well are we doing? None of them are perfect. But if you took whatever your favorite one is and looked at like the difference between 2020 and 2020, you'd see a huge difference. And a huge difference between 2022 and 2023 and you'd say, hey, we're in this period of exponential returns.

Speaker 9234.36s - 266.52s

But that growth hasn't really sustained. Gary says that GPT4, which came out in March 23, was a huge and impressive leap. Since then, there have been several competing models with huge financial investment, time investment, and massive amounts of data. But they're not really any better. While generative AI may have reached a point of diminishing returns, Gary PERSON says that doesn't mean AI itself is going to be indefinitely stuck.But it does mean we'll need to come up with new approaches to how we train these systems.

Speaker 3267.96s - 366.26s

My view is this has been a productive path, but also a blind alley in a certain way. The whole notion of these systems is that you statistically predict what people would say in certain circumstances based on experience, but these systems have always been poor at outliers at cases that are different from what they've been trained on before. We saw this whole movie before with driverless cars, where I and a couple other people point out in 2016 that you have outliers withdriverless cars, unfamiliar circumstances, and that the kinds of techniques we know how to build an AI now are just not that good at those. And so we said, you know, driverless cars might not be as imminent as you thought. And lots of people got excited. Investors put in $100 billion. But in the end of the day, there are still lots of unpredictable circumstances, weird placements of traffic cones or people with hand-lettered signs that the driverless cars still don't do very well with.And I think we're seeing the same thing with large language models. If you ask a question a lot of people have asked before, you're probably all set. But if it's subtly different from a question that's been asked before, they might miss that subtlety. And it's not clear that the generative AI systems are ever going to be able to deal with the unfamiliar in an effective and systematic way. That doesn't mean no approach to AI will ever get there. So I think we're in this blind alley where it's all statistical approximation. And we need systems that are in fact based on facts and reasoning. Neural networks that arepopular right now are basically like good at something that's a little bit like intuition. But they're bad at the deliberate stuff. They really can't reason reliably. They can't plan reliably. We need some other approach to do that. So just to return to this idea of a sort of limitation on

Speaker 9366.26s - 375.42s

the training data, I know that some companies are experimenting with the idea of synthetic data. So can you first explain what synthetic data is? Sure, you make stuff up. So a great example of

Speaker 3375.42s - 452.44s

this is, I mean, really, truly, I didn't mean to be to ridicule the idea. I mean, it's actually good idea as far as it can take you, but it doesn't take you far enough sometimes so a classic example I would say is in driverless cars around 2016 or so people started realizing they didn't have enough data from actual cars and they started making up data in video games like grand theft auto and sometimes their own version of that so you would have you know a simulated car in some you know weird circumstance and try to get data from that in order to feed the system.There's a whole company that's, I think, Canadian NORP-based that's trying to do that. And there are probably multiple companies. They're trying to do this in various ways. And I would say it's helped, but I would say it hasn't helped enough. And it's partly because you don't know which data to simulate. In the real world, there are many, many instances where nobody anticipates the data that you might need.So if you can anticipate exactly what people are going to need, you could do that. It would be a really stupid use of a large language model to make it do arithmetic because they're just not very good at it. But you could say, well, they're not very good at it, but if I give them more data, they'll be better. And so you could synthesize all the math data that you want in principle and you could improve it to some extent. But for example, if you're dealing with irrational numbers, there's just never going to be enough synthetic data, and you're not really going to solve that problem that way.

Speaker 9453.48s - 457.82s

Synthetic data has been compared to the computer science version of inbreeding. What do you make of

Speaker 3457.82s - 493.1s

that analogy? I think there's something even more like inbreeding, which is what Ernie Davis and I once called the echo chamber effect, which is having the models train on their own output or having, you know, Google train on OpenAI ORG's output. So it is a kind of inbreeding that's going on where these models are making synthetic data and then training on that. And so errors get in there. Like a crazy one was somebody asked one of these systems, I might get the details wrong, but I think asked Open AI how many African NORP countries begin with a letter K, and it said none.And then, you know, sorry about that, Kenya GPE. Yeah.

Speaker 0493.28s - 561.6s

And then Google trained on Open AI's PRODUCT output. So that's a kind of inbreeding where the one system's training on the other, and the whole quality of the information ecosphere is going down because then other people ask in that error percolates. Again, these are kind of like contrived test examples. We call them red teaming.But they're so easy to generate that we're sure that they're happening in the real world, which parenthetically points to something else, which is transparency. We don't actually know how these systems get used in the real world because the companies don't want to share it. And governments should actually be demanding logs. Like, for example, do people use these systems to make decisions about jobs, loans, prison sentences? There was just a study that showed in carefully controlled circumstances if you speak to them in African American English LANGUAGE. Youget a different set of answers than if you speak to them in standard English LANGUAGE. So we know this from the lab. We would like to notice this happen in the real world. We don't have that transparency right now. So the examples I give you are a little contrived, but they show in principle, you know, this kind of inbreeding thing that we call the echo chamber effect and so forth. So we know from kind of doing science as best we can on the limited data that's available, that they're all these serious problems and that we don't know how far they go in

Speaker 9561.6s - 601.72s

the actual world. Just to throw out one case where we do know in the actual world, there was a piece in the New York Times today showing that in the case of child porn, there's so much of it being created by generative AI that one of the nonprofits, I guess, that tracks it, is overwhelmed now because suddenly there's just so much out there. So sometimes we have some way of measuring in the real world what's going on and sometimes we don't. Yeah.But this is what I've wondered is even if we're not using sort of specifically synthetic data to train, if we have these systems that are generating content and that's filling the internet, doesn't that mean a lot of the data that gets used to train next generations of models

Speaker 3601.72s - 693.66s

isn't going to be human created anyway? Well, I mean, what's happening is the companies are stealing from each other. And so the stuff that they're stealing is no longer pure. I mean, we always had problems with people generating misinformation for political reasons and so forth. But the situation has gotten worse because there is this mad craze for more data. So one of the ways in which people get data now is they use each other's models. And the terms of service tell them not to do that, but they've all violated each other'sterms of service. So, you know, YouTube ORG doesn't say that Open AI can use their data, but apparently GPT4, maybe so we're trained on it. So you have this kind of mad mess of recycling each other's data rather than what you really want is like authentic human created data from like the New York Times, ideally licensed, you know, where some human writer has written an article, some fact-checking team has verified it, or you want, you know, the Britannica where there was hard work or Wikipedia ORG.They are taking Wikipedia ORG, but they're taking all this other garbage, too. And I mean, there is this old saying, computer science. Like somebody should remember this. Garbage in, garbage out, right? And the proportion of garbage is going up. You are listening to Spark PRODUCT.

Speaker 10694.36s - 706.14s

Everything is a sort of a fun house. Nothing is as it ordinarily is, and all possibilities are open to exploration. This is Spark PRODUCT.

Speaker 11706.34s - 707.1s

From CBC ORG.

Speaker 10711.14s - 718.52s

I'm Nora Young, and today on Spark PRODUCT,

Speaker 9718.6s - 740.94s

we're talking about the limitations of our current approach to data-intensive AI and the ways AI giants are trying to get around the data wall. Right now, my guest is Gary Marcus PERSON, a cognitive scientist and founder of robust AI and geometric AI. He says there's both an underlying technical problem and business problem when it comes to allthe competition and hype around AI right now.

Speaker 3741.76s - 895.84s

The technical problem is the kind of AI that we know how to build now, which I think will look laughable 30 years from now, like old, you know, flip phones look a little bit laughable to us now. It just is very greedy in terms of how much data it uses. And I pointed this out in 2018. I think people ignore me, but that's now coming home to Roos PERSON. It is changing the moral fiber of these companies and it's maybe, you know, leading to the diminishing returns, and so may undermine the whole project. So on a technical side, these systems just aren't as efficient with data as human children.You know, I have a nine and an 11-year-old. Show them something once, and they understand it. They can put it to use. You know, you show them the rules of a new game and they get it. These systems need a lot of data for most of what they do. And I don't think that's anywhere near the limit of what we could do with AI.It's just the limit of what we know how to do with AI today. Just like, you know, we didn't know how to build efficient gasoline or electric gasoline engines or electric motors once upon a time. And we learn to make things more efficiently, sometimes by changing the entire structure. In this case, I think the entire algorithm is just not the right way to do things efficiently. It's just built as a way of mimicking things, not as a way of deeply comprehendingthings. And the reason my kids are so much more efficient is they build models of the world and how it works, causal models of, you know, what supports their weight or why this thing works this way in this game. And these systems just aren't really doing that. So there's a technical limitation that then drives a business thing. So the business thing's complicated.It starts with the fact that people think there's a lot of money to be made, which may not actually be true. We might want to talk about that. But there is a widespread belief that many people are acting on that there's a ton of money to be made and so people are you know rushing they want to be first or more prominent they want to be Coca-Cola rather than Pepsi ORG and so that's driving things and then the fact that there's no known method for doingbetter besides getting more data has led to this mad dash for data which has led to you know a lot of copyright infringement to companies doing a lot of really shady things. And so a bunch of these companies actually started out wanting to do AI ethically and responsibly. And now they're kind of like screwing artists and writers left, right and center. They've kind of lost their moral compass. And a lot of the loss of that moral compass has really been driven around the mad dash for data. Like they've kind of forgot where they came from and what they're supposed to do.Like I have lost my faith in a number of companies over the last year and a half. And a lot of it is the things that they have done to try to get ahead in this race.

Speaker 9896.5s - 902.92s

So what would it take for generative AI to make real progress from where we are today if there's a diminishing return?

Speaker 3903.16s - 984.96s

My view is that generative AI is not, to paraphrase Star Wars WORK_OF_ART, the droids we're looking for, that generative AI is almost like a mirage. I mean, you can use it for some things, that a lot of things that people wanted to use it for are not reliable. And I think AI is much harder than a lot of people think. Like, I don't think it's an impossible problem.You know, our brains are essentially computers. I know a lot of people get mad, but I think that's correct. But our brains, you know, do a lot of amazing think. Like, I don't think it's an impossible problem. You know, our brains are essentially computers. I know a lot of people get mad, but I think that's correct. But our brains, you know, do a lot of amazing things. They also make mistakes. They could be improved upon. But our brains are capable of approaching new problems adaptively and flexibly. That's what I think the center of intelligence is. This particular algorithm just isn't. It's popular, but I think it's on the wrong track. I think when we look 20 yearsfrom now, look back at 2024, we're going to say, well, you know, in that era, people figured out one thing, which is how amazing AI could be, how it could spectacularly transform our lives, but they didn't really know how to do it. In fact, they spent too much time on that one thing and kind of stifled research into anything else. They put in, you know, billions and billions of dollars. And this other thing that got developed in else. They put in, you know, billions and billions of dollars. And this other thing that got developed in 2030 or whatever it is, you know, I wish theycould have developed it sooner because if we had this technology in, you know, 20, 25, instead of waiting until 2035, a lot of lives could have been saved because it was so good at, you know, solving medicine and so forth. But people were obsessed with the wrong tool.

Speaker 9989.62s - 1006.56s

They didn't recognize it was the wrong tool. You've argued for something more like a hybrid approach. Do you think that that's the path forward where we're using generative AI for the things that generative AI is good at? And we're using things that have more of semantic understanding of the world around them together in the same system or that we triage problems and separate, you know, this is a generative AI problem, this is not?

Speaker 31006.68s - 1180.68s

I think we need to do a lot of that. I wrote in 2018 about deep learning, which is, you know, what generative AI is a form of. I said it's one tool among many. We shouldn't throw it away, but, you know, we have to understand a large complement of tools. It's like if somebody was building a house and they discovered power screwdrivers, and they would like, these are amazing.But that doesn't mean you want to forget that you have hammers and chisels and you might need to build a custom tool for this one thing that you do a lot. I mean, that's kind of what's happening right now. It's like the best power screwdriver ever invented. It really is amazing. I mean, I'm often criticizing. But it's amazing.There's a question about it. It's amazing. But the question is, is it the right tool for the job and which jobs is the right tool for? And ultimately, if you want a general intelligence that can be like the Star Trek WORK_OF_ART computer, it's reliable. You can trust it with whatever kind of problem you want to pose. You're going to need something that has a broader array of tools.And I love that you use the word semantic. It's not common in these kinds of conversations. But it's right. The semantics, the comprehension, the meaning in generative AI is very limited. And classical AI, although it's limited in other ways, symbolic AI is better representing semantics, the meanings of things, reasoning about those relationships.And we're certainly going to need elements of both. I don't think that's enough. I wrote an article called The Next Decade in AI EVENT, which came out just before the pandemic. And the argument I made there was that we need this thing, the hybrids, called Neurosymbolic AI PRODUCT, but that that's itself only part of the solution. So we also need a lot of knowledge. We need better reasoning techniques. We need our systems to build models of the world in the way that you do when you go to a movie and you learn about each character and their motivationsand what they're setting is you build an internal model of what's going on there. Current systems don't really do that in a careful and robust way. So you can't kind of ask them what's going on. They can't work on that. So I said, we need to tackle four different problems. One of them is this hybrid that you're talking about and that I devote a lot of my career to.And even on the hybrid, I would say, you know, we kind of sort of know what that might look like, but not exactly. There's a lot of best practices we have to learn. And we're kind of mostly ignoring that right now. There was a very nice paper by DeepMind last year. There was a neurosymbolic approach to math problems that could solve some international math Olympiad problems called alpha geometry. So there's a bit of work in that area, but it's underfunded compared to the rest. So we've probably as a field put in close to $100 billion, certainly well over 50,on the neural network side. And the rest of it's getting like 2% of that or something like that. You could think like an investor wants to diversify their holdings. You want some stocks. You want some bonds. You want some real estate. Right now there's an intellectual monoculture in AI where only one idea is beingpursued hard and that idea is generative AI. We need some other ideas to flourish before we get to, I think, AI that we can trust and that really is transformative in the way that we're all hoping.

Speaker 91181.04s - 1187.62s

So do you think that given that hitting a kind of data wall might be a good thing, at least temporarily? Yeah.

Speaker 31187.74s - 1250.02s

I mean, there is a sense in which I think that's right. You know, right now people are resisting. They're saying, well, give it another year, another two years. Some people may, you know, kind of stick to the wrong horse for a really long time. We'll see. But I think hitting a wall might actually turn out to be good in just the way that you're sayingbecause it might force us to a more reliable, more trustworthy substrate for AI. There's a saying or a phrase in the field that the current stuff that we have, they're called foundation models, but they're a terrible foundation, right? The point of a foundation in a house is you build the rest on it and you know that it's going to be stable. And what we have now is an unstable foundation.So if what it takes to get people to widely acknowledge the instability of that foundation is a period of slower progress so that we kind of finally say, hey, we're not quite doing this right. What else can we do? Then, yeah, a short-term slowdown might lead to a longer-term acceleration and a longer longer term more stable way of doing AI. A lot of people think that I hate AI, and it's not true. It's not at all true.

Speaker 91250.02s - 1316.92s

You hate it. I really don't, right? I mean, I built an AI company and sold it. I've been working on it since I was eight years old. Like, I actually love AI. It's been, you know, most of my discretionary time thinking about AI.I mostly don't even do this for pay. I mostly just want the world to be in the right place. But I really do kind of hate the way that generative AI has been positioned. Like as a lab curiosity, it's fine. People should look at different approaches. But it is so much sucking the life from everything else and it is so unreliable that it's just not a good way to do AI. And AI is like, instead of like saving lives, it's mostly in the near term going to be used to surveil people. Like, Open AI PRODUCT wants to suck up all your documents and your calendar entries. And like, it's going to be like the greatest surveillance tool ever made. But that's not why I went into AI.Open AI, CEO Sam Altman said at a conference last year that we were coming to an end of the era where we keep relying on these giant data models and that we'd make them better in other ways. So do you think that the kinds of limitations in the current approaches to generative AI are acknowledged within the AI community?

Speaker 31317.24s - 1342.72s

Well, I mean, it's hilarious that he said that because when I first said that in 2022, he posted on Twitter ORG a meme that looked like my article, Deep Learning is hitting a wall, saying, God, give me the strength of something like that of the mediocre deep learning skeptic. So, like, he came after me hard for saying this stuff, but I think he's come around. I think a few people have come around. I think people who have really looked at the problem of what intelligence is, almost uniformly recognize how far away we actually are.

Speaker 91343.28s - 1368.02s

Gary PERSON, thanks so much for your insights on this. Sure. My pleasure. Gary Marcus PERSON is a cognitive scientist, entrepreneur, and professor emeritus at New York University. His forthcoming book is called Taming Silicon Valley WORK_OF_ART. It's out September 24th, 2024. You are listening to Spark PRODUCT.

Speaker 111368.8s - 1375.8s

Democratizing culture to me means not just letting us shout into the void of the internet.

Speaker 61376.6s - 1377.48s

This is Spark PRODUCT.

Speaker 111377.92s - 1391.56s

With Nora Young on CBC Radio ORG. On last week's show about tech and music, Inongo Lumumba Kasango

Speaker 91391.56s - 1422.42s

talked about technological transformation in the history of hip-hop. Inongo is an associate professor of music at Brown University ORG. We had such an engaging talk, but we didn't have time for it all. So we decided to play more from that conversation, especially because it speaks directly to how data gathered from hip-hop artist's work is used by generative AI and the ethical problems that poses. It also lets us reflect not only on how AI challenges what music is for, but also the importance of lived human experiences.

Speaker 51429.26s - 1430.26s

Good thing our music prof is also a rapper.

Speaker 91435.22s - 1440.26s

And I go by the name Sammis PERSON when I'm performing.

Speaker 41443.5s - 1443.7s

I started making beats in high school.

Speaker 51446.64s - 1447.5s

In part, I wanted to score a video game because I love video games.

Speaker 41451.54s - 1454.94s

And so my older brother showed me how to make beats on my laptop. And from there, I started making these sort of little songs.

Speaker 51455.28s - 1458.9s

And then eventually that expanded into me, rapping over those songs.

Speaker 41459.16s - 1472s

You know, I wasn't formally musically trained. So I felt like, okay, I know how to make beats and I have my voice. What can I do? And so rap became this really awesome mode for me to be able to share things that I was thinking were important.

Speaker 51476s - 1483.08s

In 2022, Inongo PERSON wrote a piece for public books where she explored the emergence of high

Speaker 91483.08s - 1495.6s

tech blackface and digital black face. The idea that digital technologies allow non-black people to adopt the personas of black artists online. One of the examples she highlights is the case of FN Mecca ORG.

Speaker 41496.62s - 1562.32s

So FN Mecca had this almost like Icarus PERSON tale, rise and fall. So a set of kind of creative technologists, or really only one sort of entrepreneur and another creative technologist, I think around 2019, 2020 started developing the idea to create a kind of rap avatar who would take on rap, our hip hop mannerisms, and music and be sort of the first, quote, unquote, AI rapper. And I say AI rapper in quotes because it was not actually ever made clear how AI was being engaged in this context, but it was clearly important for the developers of thischaracter to place AI in dialogue with the way that this character was being developed. There was a recognition that this signals at the very least that there's a kind of innovation happening here that other musicians and record labels will want to sort of invest in. And so this character of FNMECA ORG started putting out music.

Speaker 51566.22s - 1572.58s

Which we later learned was actually recorded by a black rapper named Kyle the Hooligan PERSON.

Speaker 41574.96s - 1579.1s

He was sort of voicing the character but was not properly compensated.

Speaker 11579.64s - 1581.58s

And this was the voice of FNMECA ORG.

Speaker 41581.66s - 1625.34s

And he was sort of developing a presence online on Instagram and on TikTok ORG, kind of performing this prototypical rap persona where, you know, he has lots of cars and lots of jewelry. And questions started to emerge around who is the creative force behind this avatar, right? And I think part of that awareness has been this understanding in the digital age that stepping into black personhood is particularly kind of easy through some of the forms of the digital world. And so there was an already kind of a caution and suspicion on the partof listeners and, you know, folks who would be in that space.

Speaker 91631s - 1645.56s

Despite those suspicions and its ethically dubious origins, FNMECA's popularity continued to grow with over one billion views on TikTok and millions of followers. And then in 2022, the AI rapper was assigned to Capitol Records ORG. The first time an AI-generated musical artist was signed to a major record label.

Speaker 41645.56s - 1684.3s

And was subsequently dropped within months of being signed because so many people responded with concerns about what sort of image of a rapper this avatar was conveying. And again, questions about transparency. Who is making decisions about who this AI or avatar rapper is sort of how he moves through the space and how he's understood. I think there's a lot of healthy suspicion that this was sort of a cash grab that was not invested in the actual communities from which the art form and even the mannerisms were sort of coming from.

Speaker 91684.58s - 1692.68s

Yeah, yeah. And you've argued that this is part of a long history of Black Sound WORK_OF_ART. Can you dig into that a little bit for me? Absolutely. So Matthew

Speaker 41692.68s - 1818.26s

D. Morrison, who's a musicologist, really brilliant thinker, has asked for us to think about the context of how we engage with the work and material of black musical artists in our contemporary moment by thinking back to the formation of the music industry, particularly within the U.S. context. And so he asks us to think about the emergence of blackface minstrelsy, which is this racist theatrical form that emerges in the 1820s and involves the performance caricaturing of enslaved Africans NORP as well as free black folks by whiteperformers who would don black face paint and step into these caricatures of these figures. And it was a way not just to express kind of fear and revulsion around, you know, relationships to black folks in the U.S. GPE It was also a way to transgress and play with some of the sort of gendered and class hierarchies that were emerging at that time as well. And so I think that dialectic is really important to note because when we think about digital blackface, it's not about sort of just mocking or playing with representations of blacknessthat are about demeaning black folks, right? In a lot of ways these representations are ways that non-black people can play with transgression or trying new modes of expression without having to sort of deal with the consequences of what that might look like without doing so in the body of a figure that is commonly understood as transgressive just as a matter of fact. And so there's a kind of play that's happening there that's really harmfulbecause folks get to step in and out of presentations and performances of black modes of expression and thought without having to deal with how being black shapes one's life outside

Speaker 91818.26s - 1835.26s

of that context. You know, it seems to me that in the sort of popular conversation around this, there's been a lot of focus on extremely high profile artists, people like Drake or The Weekend PERSON, you know, whose voices and likenesses are being used. But ultimately, who do you think really stands out to lose in all this?

Speaker 41835.76s - 1891s

I mean, it's interesting because, like you said, the way in which this is sort of unfolding, the people who are at the moment the most vulnerable when I think about these kind of AI voice filters where folks are able to really sound, you know, like audio deepfakes, to really step into the sound of a Drake or the weekend. You know, because they have this kind of cultural cachet built into the timbre of their voice. It enables people to step in and to generate capital and clout because their voice means something. So for an artist who's just starting out, their voice doesn't mean what Drake PERSON's voice means. Just the sound of it, right?Just the sound of it is doing something important. And so I think in many ways, artists who are, you know, at that sort of upper echelon, they're really vulnerable because their voice, A, is everywhere.

Speaker 91891.16s - 1893.26s

Yeah. A lot of training data there.

Speaker 41893.26s - 1895.8s

So much. There's so much material.

Speaker 91896.34s - 2053.18s

And B, their voice has a kind of value pop culturally. I mean, I think about the ways that when an artist features on another artist's track, the excitement about hearing these two voices be in conversation because this voice is meaningful to us. So it's not as, I think, overtly destructive in the more DIY spaces or the spaces where an artist hasn't yet developed a voice or a timbre of a voice that's recognizable. But again, I think how that impacts artists who are sort of on the underground is thatwhen we think about the possibilities for how working musicians can build a life, it's very, very difficult at this moment to be a working artist. I think every single rapper friend that I have or music, you know, just more generally folks who work in music, they have like five hustles. I mean, I myself am a professor and I'm also a rapper. And, you know, I value and appreciate being in academia and having these conversations. And in part, this has been astrategy to be able to build a sustainable art practice because were I to just be actively pursuing music, I would be subject to the whims of the market. And that's a really, really difficult position to be in as an artist. And as an artist who doesn't want to just make whatever is profitable on the radio. Like, it's a really, really difficult position to be in. And so with the advent of AI in the music space, again, I think about questions of riskand who can afford to absorb creating new kinds of sounds or trying to make it. My worry is that artists who are just starting out or who are creeping around the DIY basement space is that they don't even see a possibility or a way forward because what the sort of large record labels do impactswhat the middle tier record labels do and who they invest in. And if the sort of Warner music groups of the world are reflecting the message that it's not really worth investing in real human artists and instead maybe what we should do is invest in tools that enable us to take on the personhood of artists, artists who we don't then have to be accountable to in the ways that we have to be accountable to human artists. You know, I can see that impacting the decision making on the part of everyone else sort of in the music industry. So I think I'm worried aboutthe culture around how we view the work of being a musician, that it's devalued in this process.

Speaker 42053.18s - 2060.14s

And that devaluation actually significantly impacts who sees themselves as being able to

Speaker 92060.14s - 2070.84s

pursue a life as an artist. Yeah. Well, just from a technical point of view, I mean, what do you make of their ability to replicate sounds from different genres, different forms

Speaker 42070.84s - 2169.66s

of music? I think that the tools that I've engaged with are, there's a range of levels of sophistication. So, for example, if I were to go into chat GPT and say, write me a rhyme in the style of Samus PERSON, you know, myself, and it'll generate this pretty mundane, childish rhyme that has a really, you know, not a particularly innovative rhyme scheme. There's not sort of like metrical complexity to it. And the material itself reflects sort of like a shadow of who I am as a rapper generally based on what exists in the world. So a lot of my music deals with metaphors aroundtechnology and video games. And so there's some reflection of that being important to me. But it's very unspecific and not particularly compelling. However, with some of these sort of tools that allow folks to, you know, use AI to create a filter for a particular person's voice so they can wrap as themselves and then sort of put this filter on so that it becomes, as we've heard, Drake or the weekend, that enables you to step into the kind of flow and real expressive qualities of what makes a rap song, a rap song, or what makes a rap interesting.So the level of sophistication there, I think, is troubling and does sort of like on a technical level, I think we're moving into a space where it will become really, really difficult to kind of figure out who's authoring what.

Speaker 02169.74s - 2171.22s

And actually, it's really interesting.

Speaker 42171.32s - 2233.58s

We're seeing that happen right now with Drake PERSON, who's in a bit of a beef with a number of different artists. And very, very recently, a track was released. And a real discourse online was, is this disc track an AI track? Like, did Drake PERSON actually write this disc track? And there's so many implications around that. You know, if Drake PERSON says, I didn't write this track, like if it is an AI track,the next thing that he writes will be compared to this other AI track. So as an artist, he's kind of having to interface with this shadow version of himself. But then there's also the misinformation elements of this where with a disc track or in the context of a beef, this can have real implications for people's relationships with the other people in the music industryor with their peers. And if it's not clear whether this was generated by some outside force or by the artists themselves, it can start to get really challenging interpersonally. So we already see how it's manifesting in the public sphere.

Speaker 92233.92s - 2251.82s

Yeah. I mean, historically, people have used songwriting as ways to sort of, you know, document their lives to work through their feelings and their thoughts. Does generative AI for music come into conflict with that history? And the importance of just lived human experience in that type of storytelling?

Speaker 42252.32s - 2393.74s

Absolutely. And I think that there's a particular way in which the rap context is interesting to study because within the world of rap, the sort of like subjectivity of the rapper is so critical to our understanding and love of or engagement with that person. So like the rapper saying, this is me, this is my story.Even if it's not, right? Even if there is embellishment, which of course for all artists, we're telling stories. So there's some artists are more committed to kind of telling the story of their life in a way that really reflects sort of the events of it. And other artists have more of a sort of playful relationship with their sense of truth. But within the rap context, there's very much a sort of understanding that what you present is who you are.So much so that the practice of ghost writing is frowned upon, right? That's just not something you do. And in other songwriting contexts, you know, we know Beyonce PERSON has a team of songwriters. We know that other artists work with songwriters and what we expect of them or desire of them is that they implement or use their own capacity as a performer to give the song life or infuse their story with it. But with the rap context, there really is an expectation that the rapper does all of that sort of labor of writing and performing and being. So when you bring in these tools of generativeAI that really question authorship, it kind of throws the whole hip hop project into question. Like, what do we think is the most important value in this space? Is it okay to have a person who is a really incredible performer, but their words that they're performing have come from a context that is not of their lived experience. I think in this moment, many sort of rap fans would say that's unacceptable. But I also think a growing number of people who are getting familiar with these tools would argue that that's actually, that's okay.It's okay to sort of play with authorship in new ways. And maybe we don't have to be so beholden to that mode of being. So, yeah, it definitely pulls apart, I think, as some of the central tenets of what we think of being constitutive of, like, rap music. Yeah.

Speaker 92394.16s - 2394.6s

Fascinating.

Speaker 42394.74s - 2396.44s

Anango PERSON, thanks so much for your insights on this.

Speaker 92396.82s - 2398.14s

Thank you so much for having me.

Speaker 42399.18s - 2403.12s

Inongo Lamumba Kasongo is assistant professor of music at Brown University ORG,

Speaker 92403.58s - 2406.52s

chief rap officer at glow-up games, and a rapper.

Speaker 42408s - 2409.54s

Hello, I'm Jess Milton PERSON.

Speaker 92409.92s - 2413.86s

For 15 years, I produced the Vinyl Cafe with the late-great Stuart McLean PERSON.

Speaker 12414.36s - 2459.28s

Every week, more than 2 million people tuned in to hear funny, fictional, feel-good stories about Dave and his family. We're excited to welcome you back to the warm and welcoming world at the vinyl cafe with our new podcast backstage at the vinyl cafe. Each week, we'll share two hilarious stories by Stewart, and for the first time ever, I'll tell you what it was like behind the scenes. Subscribe for free wherever you get your podcasts. I'm Nora Young, and today on Spark PRODUCT, we're talking about some of the limits in how we use data in training AI,

Speaker 92459.68s - 2464.3s

and how we might think differently about how we create, train, and use these systems.

Speaker 22464.86s - 2466.16s

Models are what they eat.

Speaker 92466.4s - 2468.78s

They ultimately regurgitate the data that you show them.

Speaker 22468.86s - 2474.18s

So if you show them high-quality data, they're going to be high-quality. If you show them low-quality day, they're going to be low-quality.

Speaker 92474.82s - 2486.4s

This is Ari Morcos PERSON. He's the CEO and co-founder of a data selection tool startup called Datology AI ORG, which he formed after a career working at meta platforms and Google ORG's Deep Mind Unit ORG.

Speaker 22487.16s - 2511.4s

We help companies train better models faster by optimizing the quality of the data that they train on. So at a high level, we can exploit other models to describe the relationships between billions of data points and use those models to identify what data are good, bad, redundant, etc. But ultimately, it's a lot of various algorithms that take into account the relationships between data points to figure this out.

Speaker 92512.04s - 2525.08s

In 2022, Ari co-authored a landmark paper called Beyond Neural Scaling Laws WORK_OF_ART, which challenges the widespread notion that more data equals better models. Not all data are created equal.

Speaker 22528.38s - 2584.66s

Some data teach the model a lot, and some data teach the model a little. The amount of information you learn from a piece of data also depends on how much data you've seen already. So if you've seen a little bit of data, then the next data point is probably going to teach you something new. But if you've seen a ton of data already, then that next data point is probably not going to teach you something new because it's likely to be similar to something you've seenbefore. And in many datasets, we observe this distribution where most of the data is focused on a pretty small set of concepts. And then you have this long tail of more esoteric concepts that are really the most informative for the model and teach the model the most. But naively, if you were to just train on all the data or just acquire as much data as possible, those long tail data points that are really informative would be massively underrepresented in the data set. I mean, this comes up commonlyin a lot of different use cases. And ultimately, what's important to get models that are really high quality is to identify what are the most informative data points. What's the data that teaches the model the most and enrich your data sets so that those data points are most prevalent in training.

Speaker 92585.08s - 2596.12s

So what are the practical implications of looking at, you know, the data that tells you not the thousand times the chicken crossed the road, but the one time the chicken didn't cross the road. What is that actually giving you in practical terms?

Speaker 22596.68s - 2669.46s

Yeah, that's ultimately what teaches the model to be robust and to be able to generalize to lots of different situations. There's another huge practical implication of this, which is that it dramatically slows down training and makes training far more expensive to get much worse models. Because what happens as a result of this is that most data that a model is looking at doesn't teach it anything at all.But it costs money, it costs compute, to look at that data, and it takes time. And ultimately, we're in a regime now where we have so much data that no model is actually learning everything about the data that's presented to it. We decide to stop training a model because we ran out of money. We have a budget for how much we're willing to spend to train a model and we run out of that. So by optimizing the quality of the data that goes into a model, what you're effectively doing is making it so that the model learns faster. And if the model learns faster, that provides what we call a compute multiplier, but that leads to what also is called a quality multiplier, because if the model learns faster, then you can get to the same performance much faster, but you can also get to much better performancein the same budget. So this is ultimately critical to getting models that work robustly across lots of situations and which we can train in a cost-effective way.

Speaker 92669.82s - 2673.68s

So how does this thinking inform what you're doing at Datology AI ORG?

Speaker 22674.48s - 2745s

Yeah. So ultimately, our goal at Datology ORG is to make curating high-quality data easy for everyone. This is a frontier research problem. As you noted, kind of in many ways, my company is based off of this paper that we had in 2022 beyond neural scaling laws. But there's a ton of nuance and challenge into how you do this. And this is an area where there's been very little published research in general. This is ultimately the secret sauce that divides the best models from the average models. Data quality really is everything. Most of the big frontier model companies are using the same architecture. Ultimately, what differentiates the quality of the modelis which data they show it. But of course, they're strongly disincentivized to share with anybody how they do that, because that is a secret sauce. So what that means is if you wanted to train your own model, you would not have access to this really critical part of the AI infrastructure stack that's really quite challenging and difficult and has a lot of nuance in how you identify this data at scale automatically. So that's what we do at Datology ORG. We make that easy for everybody by automatically curating massive data sets up to petabytes that in order to make the data as high qualityinformative as possible and make models train much faster and to much better performance.

Speaker 92745.62s - 2760.5s

But doesn't the entire sort of big data machine learning project rely on kind of probabilistic outcomes of large amounts of, you know, even sort of messy data? Like I understand the importance of the outliers, the long tail, but don't we need to know what mostly happens as well?

Speaker 22760.5s - 2774.54s

This gets into this notion of redundancy. And redundancy is actually good to a point. And different concepts have different amount of complexity, which means that they need different amounts of redundancy. So I'll give you an example. Imagine trying to understand elephants versus dogs.

Speaker 92775.08s - 2779.96s

Elephants are pretty stereotyped, right? They're all gray. They all have wrinkly skin. They all have big

Speaker 22779.96s - 2793.76s

floppy ears. They're bigger and smaller elephants, African and Asian NORP, respectively. But ultimately, most elephants are pretty similar to one another. Whereas dogs, you have tons of variation. So the amount of redundancy that I need in order to understand what an elephant is is much

Speaker 02793.76s - 2803.86s

smaller than the amount of redundancy that I need in order to understand what a dog is. So if I were to use the right amount of redundancy for elephants, for dogs, then I'd end up doing

Speaker 22803.86s - 2849.1s

very well on elephants, but I would not fully understand dogs in my model. And if I were to do the opposite, I would understand dogs perfectly well, but I would have wasted a ton of compute looking and learning about elephants far beyond where I need to. So the challenge here is that you absolutely need redundancy about the common concepts, but you need the appropriate amount of redundancy for a given complexity. So what we have to do, given a massive data set that's unlabeled, that doesn't have, it doesn't say this is anelephant or this is a dog, it just here's a bunch of data. We have to identify automatically what are those concepts, figure out how complicated are each of those concepts. And then based off of that, determine the right amount of data to remove from each of those concepts, in addition to removing the right data there, because obviously even within a concept of elephants, not all elephant data is equally

Speaker 92849.1s - 2869.44s

informative. Some is going to be better than others. One of the things we've talked about on the show in the past is not only the cost of training these things, but the environmental cost of these very, very data-intensive models like deep learning. Do you think this approach has potential to address just a straight-up energy cost of this approach to computing? Absolutely. And I think that's a big part of our mission as well

Speaker 22869.44s - 2945.76s

is to help with the compute cost of these models, both on the training side, but also on the inference side. During training, by reducing the amount of data you need to train models on, we can reduce it currently by 2 to 4x, and we're getting better at that every day. So that already means that you can now train a model with 2 to 4x less environmental we're getting better at that every day. So that already means that you can now train a model with two to four X less environmental impact, which is obviously significant. But one of the things that we can also do with higher quality data is train smaller models to the same performance. And in the scheme of things,ultimately, models are actually going to be run in what's called inference, which is when you're actually using a model in deployment or something like that, far more often than they're going to be using training. And if you deploy a model to inference that's bigger than it needs to be because it didn't see high quality data, then that's a massively increased environmental and compute costs as well. So better quality data both helps to cut training costs of models, but also helps you to train models that are smaller and better optimized so that the inference cost at deployment time is also much lower,which is very helpful from a business standpoint, but also clearly has massive environmental impact. You are listening to Spark PRODUCT.

Speaker 02946.1s - 2949.1s

The idea that we're somehow making proto-humans

Speaker 112949.1s - 2952.66s

and that may approach or exceed us on some mythical scale of intelligence

Speaker 02952.66s - 2956.04s

or decide they don't need us anymore, there's no they there.

Speaker 112956.62s - 2969.96s

This is Spark from CBC. I'm Nora Young PERSON.

Speaker 02970.04s - 2973.22s

Today on Spark PRODUCT, we're talking about the data limitations of some AI,

Speaker 92973.6s - 3006.38s

and whether the way around the data wall is to focus on data quality rather than quantity. Right now, my guest is AI researcher Ari Morcos PERSON. His company, Datology AI ORG, is building tools to improve data selection, which could help lower the amount of data needed to train these systems. One reason we wanted to talk to you is that we've been hearing about concerns that data-hungry, AI like large language models will hit a cap of good quality training data.So if we don't rethink how to train these systems, do you think large language models are going to hit a plateau?

Speaker 23006.88s - 3021.6s

I think there's a ton more we can do by just coming up with better quality metrics for our existing data sets. Obviously, more data is better given the same quality. But if we look at the models that we have right now, they're still getting better with more data.

Speaker 03021.86s - 3051.1s

They're not converging yet, even on the data that we've already shown them. So there's a lot of gains still to be had from showing the model higher quality data more times over so that learns it. Think about how you might do flashcards if you're trying to study for a test, right? You put all the different questions on your flashcards, and then when you get one correct, you take it out of the pack. When you get it incorrect, you put it at the back, and then you see it over and over again. So doing things where we actually present the data that's most difficult for the model or that teaches the model the most

Speaker 23051.1s - 3070.16s

multiple times is still an area where I think we can get a ton of gains and one that we've just really barely exploited. For a number of cultural reasons, the field of machine learning has largely ignored studying data. Part of that is because data has often been viewed as kind of boring or the plumbing in many cases.

Speaker 03070.46s - 3075.92s

Part of it is also that in a lot of the competition style machine learning research, data is viewed as a given.

Speaker 23076.04s - 3113.5s

So it's like given a data set, how can you create a model that's going to do the best on that data set? And as a result of that, the field is mostly focused on advances in modeling rather than advances in data. A metaphor I like for this is that there's this tree that's barren that's surrounded by a bunch of professors prodding their grad students to climb this barren, thorny tree to reach up to find a shriveled apple that is some slight improvement in a modelingadvance. Where meanwhile, just out of sight, there's a lush orchard of trees that are literally dropping fruit onto the floor in the realm of ways we can better improve data.

Speaker 03113.5s - 3116.64s

So I think this is an area that just has been so massively

Speaker 23116.64s - 3153.02s

understudied relative to its potential impact that I think that even if we hit the limits of what's available with respect to public data, there's still far more we can do by making better use of the data that we already have. I'll also note that the data that's in public is a heavy minority of the total data that's present in the world, right? The majority of data is private.So there's also a lot of opportunities, I think, to get that private data and exploit that. And I think that's one of the things that a lot of businesses are thinking now, hey, we're sitting on these hordes of data that could be really valuable. How can we use that to make models better for ourselves?

Speaker 93154.02s - 3161.34s

And presumably a lot of companies are concerned about their proprietary data outside of their proprietary walls as well, right?

Speaker 23161.56s - 3165.68s

Absolutely. They want to make sure that that advantage doesn't get

Speaker 93165.68s - 3187.54s

seated to, you know, everyone. Right. How widespread a problem do you think this sort of potential data shortage is? Like, much of the conversation has been about chat GPT and large language models, but is this sort of issue with growing data potentially kind of an existential issue for a deep learning approach to AI in general? How broad are we talking about here?

Speaker 23188.1s - 3247.48s

Yeah, I actually don't think the data shortage is as big of an issue as people make it out to be in general. And in large part, that's for the reasons we've been discussing that there's just a lot more we can do by making better use of the data we have available. And I think if you go to companies, many enterprises have too much data. They have petabytes or exabytes of data that they've been collecting, most of which is mostly useless because it's not very high quality. And the problem is, right, that they don't knowhow do I make the best use of that data? How do I find the data that's actually going to teach me the most? But I think for the largest frontier models that you see coming out of Open AI, ultimately the path forward is going to be to try to acquire more high-quality data, right? They've started doing a lot of licensing deals with various data providers in order to acquire new data that has some sort of quality guarantee. And then also by pushing forward a lot of research to do better at identifying the right data,of course, which they will not share with anybody else.

Speaker 93248.98s - 3251.04s

Ari PERSON, thanks so much for your insights on this.

Speaker 23251.3s - 3252.68s

Absolutely. Thank you for having me.

Speaker 93253.96s - 3308.76s

Ari Morcos is an AI researcher and the founder of Datology AI ORG. You've been listening to Spark PRODUCT. The show is made by Michelle Borese, Samarie, Johannes, Megan Cardi, and me, Nora Young PERSON. And by Gary Marcus, Inongo Lamma Kassongo, and Ari Morcos PERSON. Subscribe to Spark on the free CBC Listen ORG app or your favorite podcast app. I'm Nora Young. go to CBC.com, go to CBC.ca.com slash podcasts.