Gradient Dissent Conversations on AI - Scaling LLMs and Accelerating Adoption With Aidan Gomez at Cohere

Scaling LLMs and Accelerating Adoption With Aidan Gomez at Cohere - Gradient Dissent: Conversations on AI ![rw-book-cover|200x400](https://wsrv.nl/?url=https%3A%2F%2Fartwork.captivate.fm%2F25fd1181-b46e-459b-85a5-d397eec4cdcf%2FJDLDW81K-wlJoAWL7ZnxLdTp.jpg&w=100&h=100) ## Metadata - Author: **Gradient Dissent: Conversations on AI** - Full Title: Scaling LLMs and Accelerating Adoption With Aidan Gomez at Cohere - Category: #podcasts - URL: https://share.snipd.com/episode/1ffdc463-f514-4968-9b90-b75ac1577f62 ## Highlights - SSMs: Finding the Middle Ground Between Transformers and RNNs Key takeaways: • SSMs aim to find a middle ground between transformers and traditional RNNs/LS Tamps. • SSMs provide an internal memory that can be read and written from. • Scalability is a key feature of SSMs, able to be parallelized across thousands of accelerators. • Success of SSMs may depend on the development of software tooling by the community. Transcript: Speaker 1 I'm kidding. I'm being facetious. So at SSMs, the idea is that we're trying to cut some middle ground in between transformers, which are like fully auto-aggressive. They tend over the entire past sequence. And then on the other end of the spectrum are LS Tamps or RNNs, which they have a state and they just- they need to memorize in order to remember the past. So SSMs are trying to find this middle ground where, yeah, you have some window within which you can do look up. But for everything that's outside of that, you can rely on an internal memory that you can read and write from. Okay, so doing that, which sounds a lot like a middle ground between the two, doing that while also being extremely scalable. So you can paralyze it across thousands of tens of thousands of accelerators. And so it's trying to strike that middle ground. I think its success is going to be predicated on whether the community builds tooling for it. Because obviously like the folks at Huggingface with the Transformers library and many others, they've built incredible software tooling for Transformers. They make it like trivial to scale from 10 million parameters up to a trillion parameters nowadays. They've made that trivial, which is tons and tons of work at the software level. For SSMs, for state-space models, it just doesn't exist today. None of that exists. There's no mature software platform for scaling these things. So I could see a world where Transformers get replaced, SSMs, the software support for them gets more mature. And our next generational model, we lose that context window constraint that Transformers get bus words like you have this many tokens, anything outside of that. Sorry, I have no idea about it. I've never seen it. ([Time 0:08:30](https://share.snipd.com/snip/7744c4e1-fb3a-4381-a44e-0925f7373336)) - Possible architectures for large language models Key takeaways: • There are many possible architectures that could result in performance similar to existing large language models. • Sequence structure is needed to train sequence models. • Joint optimization between software and hardware has resulted in successful large language models. Transcript: Speaker 1 I guess. Speaker 2 Do you think almost like any architecture trained for long enough with enough parameters would have this property that it would improve over time? And then really what you're looking for is just something where you can do back propagation distributed quickly. Is that what you're saying? Speaker 1 I do believe that there are a lot of possible architectures that would be fast efficient and result in performance that we're seeing from the current large language models. There are some things you can't literally just scale up a MLP was relose because that would be done point wise, right? It would just be a bag of words model. You wouldn't be able to learn relationships between words. You do need like sequence structure. You need to train sequence models. But so long as you're not breaking that or like severely compromising that, I think there's a huge swath of models that would perform equivalently well and would scale equivalently Well. And we've mostly just had this joint optimization between our software like the transformers and the frameworks that train them. And even our hardware like the routines that we support within CUDA and within our accelerators, it's been feeding back on each other for a little while now. And so at the moment, there might be like a local minimum where it's hard to break off of transformers because there's just so much software and hardware support for transformers. It's so heavily optimized. ([Time 0:13:56](https://share.snipd.com/snip/9f557ef5-085a-4354-bc6a-b9a268df3d32)) - Scaling Breakthroughs Needed for Large Language Models Key takeaways: • Language models performance may hit a wall with data for the base model. • Another scaling breakthrough may be necessary. • Models are operating at or above average human performance on a wide variety of tasks. • Data from average humans no longer valuable once model performs as well as an average human. Transcript: Speaker 2 I guess when I look at the performance of large language models, it's hard to not sort of infer this exponential curve and expect that like in a year, they're going to be even more amazing Than they are today. But if your view is that we hit this wall with data for the base model, is that wrong then? The verge of needing a totally new approach? Speaker 1 I think we are approaching the need for another scaling breakthrough. I guess I don't know how close we are to it, but it definitely feels like it's coming up when we're starting to hit the limits of these models are increasingly operating at average human Performance or above on such a wide selection of tasks that you can no longer rely on the average human as a source of data to improve your model. Obviously, once your model performs as well as an average human, it's no longer valuable to collect data from average humans because they don't add to the model's performance. ([Time 0:23:40](https://share.snipd.com/snip/73af194c-8ef0-4ec2-ae9c-b0b6efc308de)) - Working on Language Models and the Prospect of AGI Key takeaways: • Experience of working on large language models has pulled up belief about when they get interesting. • Lofty goal of AGI is exciting and makes work feel monumental in terms of importance. • Focus is on making models more useful, which is critical path towards AGI. Transcript: Speaker 2 So it sounds like your experience of working on these large language models has pulled up by decades, your belief about when they get like interesting. It has for me too, to be honest. And it, I guess it makes me feel like AGI, which isn't a clearly defined. There's aspects of it that seem incredibly important for the world. It makes me think that that really could happen in our lifetime when you sort of like plot forward the current. And I have to ask for you, is that like top of mind for you and your work? Speaker 1 I don't think so. I don't spend an outsized amount of my time thinking about AGI. I do spend an outsized amount of my time thinking about how to make models more useful. And I think that's along the critical path towards AGI. We're going to build a lot of useful stuff that's going to make people way more efficient, way more productive. But like lofty goal of AGI, I think it's exciting and it's like super salient. Like it's very easy to get sucked into it and it certainly makes your work feel like monumental in terms of importance. But I don't think you need AGI for this technology to be extremely impactful. And I think there's going to be a lot of, ([Time 0:36:08](https://share.snipd.com/snip/ff9f96d2-51c7-4d8c-8f60-b8b1337b7dfa)) - The Significance of Chat GPT's Release and the Future of Conversation in Technology Key takeaways: • The release of Chat GPT in November was a significant moment for most people, as it was the first time they had a compelling conversation with a computer. • For those in the field building these models, it can be easy to become accustomed to the technology, but for most people, it was a massive leap forward. • This moment unlocked the potential for conversation to become the default way of interacting with technology, rather than navigating menus. Transcript: Speaker 1 Like it's important to remember that the end of November when chat GPT came out. That was like for most people who interacted with that product. That was the first time for most humans that they had a compelling conversation with silicon. Every other moment in their life, they'd only ever experience that with people. And I think for those of us who are like in the field building these models, it can be like the frog in the pot where nothing is ever surprising. It's all one small step from the step behind. But for most people, it was like, that was the first time they had a conversation with a computer. That was the first time a human talked to a piece of silicon. So I think it's important to remember like how massive of a leap that is. And also to think about what that unlocks, I think it's going to become much, much more common that the default way you interact with a product or a piece of technology is going to be through Conversation. Instead of having to like go through 10 layers of menus or whatever, to find the thing that you want to do, you're just going to have a conversation with that agent and it has access to the Ability to affect the change that you're asking it to do. ([Time 0:42:24](https://share.snipd.com/snip/b31efd24-c830-404a-b7a2-b1fab7c583ec))