>> Good, okay. Cool. So I'm Ronnie and I want to talk with you about classifying data. So building traditional classifiers is often pretty hard because you need large labeled datasets. And I will propose you another way where we can avoid the label training sets. So what we're going to do is I want to show you the core techniques in the background and afterwards we'll have a look to an application where we actually run this in production. >> Good. So let's get started here. >> So what I prepared is a little bit of data because it's hard to see, right? Wait, I will read it out, I guess. Maybe we do it like this. Okay, so we have a little bit of sample data here which is customer feedback for this Spotify app. So the first one says the user value offline song downloads, the second one is users frustrated with missing album art.
And the last one is the user wants more language options on Spotify. So that's the data that we want to classify. And then we have a simple prompt for this where we say classify or customer feedback into one of these categories and we have feature requests, product issues and positive feedback. And then we ask to get this back in adjacent format. So if you run this, you see it actually does pretty well. So the first one which was our download, right? That was positive product feedback. And then the next one was a product issue with the album art, right? So we get the right things back. But if you look to this, I think you can maybe spot a couple of issues here. So first of all, it's kind of slow.
It takes one and a half seconds. It also uses a lot of tokens. So one of the simple things that we can do here is we can cut down on all the structure. Sorry, just sit with me. So we can ask instead of an array, we could for example just ask back for map. So now you see it's already considerable less tokens that we use up there. Should also get a little bit faster. But we can do better, right? So for example, we know it's only handful of categories anyway, so there's no point of spelling this out over and over. So another thing that we can do here is we can say, instead of getting feature requests back, we just get our back, we get i back, we get f back and we don't want to have the whole categories.
But just the shortcuts. So if we do this, things start to get a bit more compact on the right side. Now one thing that we still have is of course the ideas here. And we have these UUIDs that allow us to express billions of different things. But we know with the token limits, we can never put more than a couple of hundred data points into the LLM anyways. So what we can do here is we can translate already before in our application this for example into a counter. And then when we get the result back, we translate it back into our internal ideas. And so what you can see now is here is way more compact representation of the output. So for the testing we did it normally speeds it up by factor two, three approximately and the amount of tokens normally goes down by 80% if it would be simple optimization.
Now you might say that's an interesting idea to use the LLM's for classification, does it work in reality? In the real world and so I can say yes it does. So this is our application where we actually use this. And so we convert customer calls into product insights. So that might be support calls, sales calls, user testing, those kind of things . And we take out like 30 second clips which are basically feedback on the product. So we can see here for example these clips. And so here these blue labels that's exactly the classifications which I saw before. Like for example here we have product feedback and there we have for example a pain point.
Now the cool thing is that we now create a kind of structured data out of our formerly unstructured data which means we can for example push it into a dashboard. So now we can see here we have like 2000s of these video clips and we can see I don't know 30% of them are for example pain points. Now 30% of 2000 is still quite a big number so they can't put us in front of a user just to read through. So what we're doing then additionally is that we aggregate them. So for example a lot of people are frustrated with the ads on Spotify. But it has the same issue so now we know what the cluster is but which items are actually part of the cluster which again is a classification problem. And so here we use the same approach and I think you can also see more where the upside of this approach is.
Where we had before kind of fixed categories. So we knew beforehand this could be like a feature request or a product positive feedback item. Now these categories are actually dynamic, right? So this can be different from customer to customer. But there might be also new classes created, there might be changing and so on and so forth. And for none of these situations, you have to retrain any models but you can just plug the data in and it will classify you the data. Good, that's what I wanted to share with you. If you wanna try it out, you can go to next.co, you can sign up there and try it out.
And I guess we still have some time for some questions. >> [APPLAUSE] >> How do you go about creating those subcategories like the first reddit with ads? >> Okay, good, so since that five minutes I skipped this topic. So the way we create them is so we take the data that is basically the clips. And then we do some realization of them. Now of course it's a lot of data so then you go into batching and sampling and that's the two main mechanisms we use for that. >> Got it, thanks. >> It was very similar to the question that he asked but so what I find with this kind of mechanism is that creating this kind of topic so it's more like a topic modeling, right? It's kind of difficult to kind of cluster them and ask the LLMs to create these topics by itself.
Do you kind of have a seed for having a topic or do you have a method in which you limit the type of topics? >> Yeah, sure, I can reflect on a couple of these things. So one of the things that we learned is having more homogeneous input data definitely helps a lot, right? And so we can use again these classifications that I showed you in the beginning. And if we say create us classes around feature requests and we only get feature requests, the quality is way better than if you just give whatever you have. I think the other challenge is that you want to have the output to be more hom ogeneous, right? So you want to avoid having one thing is, I don't know, the button should be five pixels to the right, another one, UX sucks, redo everything, right? So it's just total different kind of clarity and that's not really helpful.
So we found that giving examples works pretty well of steering it into the right direction. And I think the same holds also for basically how the output is written, where you also want to have this more homogeneous. Again, I think we found examples to be the best way of steering towards more similar classes. >> Thanks a lot for presentation, Superquel, to see the classification use cases. I was wondering how do you do evaluations, especially if you consider different models. And different frameworks for the same use case, maybe like a smaller model would work more efficiently for specific cluster of classification problem.
>> So we use here, so this was OpenAI, what I showed you for our customers. So we work mainly for enterprise customers, so it's mainly OpenAI and then Azure. That's the one thing they all love if it's in the EU. I mean, you could use whatever model I think for this thing. So we stick with OpenAI mainly for quality and it's the easiest to be acceptable by our customers. >> An evaluation part, so for example, you can use different GPT 3.5. I don't know, they're also deterministic models like semantic router. If you've tried it, would be just curious. >> So maybe to the evaluation part of it, so that's certainly one problem, right? And I think one of the issues that you're seeing is even if you put the temperature, if you control for this, still the output is actually quite different, right? So it's not like that you give the same input and you get the same output back.
So this might go, I don't know, by effect, but to the difference on, for example, the things that we find. So what we are doing is so we run like regression tests on them. They have a certain variance window and if it goes out of the variance window, somebody is looking into this. And of course, if we update, for example, the model, so we change the models that we run before on these tests and see what's coming out. >> Thank you for your presentation. I was wondering like the YouTube video, you transcribe the video, I guess, and then do you put the whole video into the prompts or do you chunk it in some way? Or like do you do any chunking? >> So what you saw before these highlights, right? That's basically parts of the videos because the problem you have with the videos is there is a lot of people talking about how they ended up in the meeting.
That's not really interesting, right? And so we basically produce these highlights that kind of aggregated nuggets of data of them and they become the representation of the videos. >> So I guess the highlights are the chunks sort of like. >> Yeah, I can't. >> So can you tell us something about how those highlights are created then? >> So this is like a pipeline where we run this through and just on a high level, it's the first step is typically you find out what's the topics in this video, for example, or can also be text. And then afterwards we are looking for examples quotes in these videos that are reflecting this and then you mix it up with, do you want to have these videos to be nice to view, for example, that's one thing so they shouldn't be too short and so on and so forth.
So there is then these other aspects that we are mixing into this. >> Cool, thank you. >> I have a question about your dynamic tags. I want to know when you generate two tags that are meant to be the same thing, but it generates the name slightly differently. How do you sort of collect them together? >> So that's a really good question and it's not a totally trivial problem, right? Because you might have your initial set and how do you incrementally create new ones. And I think we have now a mixture, it works a bit and from time to time you have to actually recompute them and then see and suggest to merge them, for example.
So one of the things that we're doing is when we incrementally create them, that we provide the existing ones. But of course then you also steer in this direction. So far in our space, for example, we tried to create them pretty early on so that new users, they discover the feature. But the problem is if they then upload 100 times the amount of data, they might not fit any longer. So then at that point, we might recreate them. >> Okay. >> And scratch. >> That seems tricky, so I wanted to ask. >> Yeah. >> I'm curious how you measure your own qualities when the user comes here and they check the topics or check the quality and so on.
How do you know you're doing well? >> So we're using a couple of things. So one is, and that goes a bit back to the recursion testing where we do have also manually labeled data if you want. And we check in new generations of models also our own stuff, how does it compare to this? Of course, you can't just compare it the same, but then we're using again, LMs to find out, do we for example, cover all the topics that are supposed to be found there. So I would say that's probably one part. Then the other thing what we're doing is so basically we give this to the user. But what the users are doing is they share this then within the organization, so they might bring it to the developers, to the designers, the marketing people.
And so for this we track if people are watching this stuff and if they share it again, and that's also an indication. Now of course that can also be, it was just the correct thing to start with. We did the best out of it and therefore it wasn't. But at least it gives some kind of indication. >> Very quick one. I'm curious if you can segment by product features of the customers, right? I have Spotify for example, so it's like podcast versus, I don't know, videos or whatever, right? And the second is, I would love to see where some type of customers are complaining about these kind of problems.
I'm wondering if you're thinking about that and what's coming next. >> So for the first one, right, where you're looking for product area. So that's actually one thing we do nearly at for every customer. But one thing we actually learned is that LMS are really overkill for this in most cases and just go all over the place. Because in most cases, the users are literally talking about the thing, right? So often if you just do a normal search, it's much faster, much cheaper and often way more precise than using LLMs for that. And then the second one was, I think was- >> Based on customer segments. >> Yeah, exactly. So basically what we have here is, okay. So basically you can split the data so we have more metadata around them.
So for example, we have them annotated from which accounts these video came. And then for the accounts you might know, I don't know, whatever it is, you see I am the geography or the industries or whatever. And then when we show for example the dashboards or your products for the data, you can use again these dimensions to split off the data. >> All right, thank you so much, Roni. That was really great and also great questions. Thank you everyone. Like this is why we want to keep demos at five minutes. So then there is more time for questions. Like we want the community to actually drive what is explained. So yeah, that was really great. Thank you.
>> Thank you.