6 minutes

speedrun blooper reel of a year of ai app development.

This video reveals how a simple glitch can have big repercussions. An app update, intended to improve its functionality, instead caused it to recommend incorrect job suggestions. A reminder that the smallest details can have a big impact.
00:00

Hey everyone, my name is Lucas Meier. I am not a firefighter. Imagine a job interview, where one of the components of the job interview is actually an interview with the psychologist. Now imagine software or like an app that takes a recording of a conversation like that and turns it into a report.

00:28

You could imagine that a prompt for an app like that would look kind of like this. You would throw in a bunch of example reports. You would throw in a transcript of that actual interview that happened. Maybe some private notes from the psychologist that did the interview.

00:48

And maybe some generic instructions on what the report should look like, how long it should be, things like that. So I have this app in production. And it works kind of great, except like from time to time we have this problem where it misgenders the candidate. It would say him when it's really her or her when it's him because we don't actually feed any of the real names into the language model.

01:15

So I thought I would fix that because normally in the interview there's some clues there to figure out at least what the gender is. So I try to fix it in the prompt. Let's do it together. I add it, analyze the transcript, try to figure out what is the name and the gender of the candidate, and write the results in candidate tags.

01:36

Like all the cool kids do with the chain of thought thing where you first have it write out some things so that later it remembers and no one knows how it works. But it's how we do these things. And I, who thinks this would work? No one gives you that, like thanks, Lee, like, thank you.

01:59

I also thought it would work because I shipped it. But I got a phone call from one of my clients. And they were saying that they were using it for a job applicant for the head of HR. And that they were getting a report that was really recommending them that this person would be a great fireman.

02:20

And then I got another phone call for a job opening for a marketing associate who was also being recommended that she would be a great fireman. So what happened? Any ideas? Typo.

02:38

Typo. Okay, so. Miss gendered. Miss gendered. So what happened is we actually asked the LLM to write something, right? What do LLMs do when you ask it to write something and it does it no? Number one.

02:57

Make it up. Make it up. Oh, here we go. Make it up. Number two, though, is look at your examples. And it turns out in these transcripts of these two clients that were complaining, there were no clues in the transcript whatsoever about the gender or the name of the person.

03:24

So what did the LLM do? It's actually not that crazy. It just took a name from the examples. It wrote candidate, Minerva and Dorn. But because of the way that the chain of thought works, now it has convinced itself that its job is to write a report for Minerva and D orn.

03:45

Which is already in the examples. Because Minerva Dorn was one of the example reports and he was applying to be a fireman. Okay, so you would think easy fix. Maybe not easy fix. Let's try to fix it.

04:03

Let's add this in the middle. When it's not clear from the transcript, what the name or gender is, always use mefrau de friece. One of the things that's not a good fix is like a one, two. Who thinks, what could go wrong? Like, should I ship this? Any idea, what's going to go wrong if we ship this? Nothing? So nothing's actually a pretty good answer.

04:36

But I think what I learned from this whole endeavor is that the thing you need to check if you make a fix like this, it would go wrong if there happens to be a mefrau de friece in your examples. So it is a, this whole adventure is a reminder to myself that you really, if you make software like this, which I do, you really need to be intimately familiar with all the examples in your prompt because if anything of the case that you are actually being using it for right now happens to overlap too much with an example, then the LLM is going to like lean into the example, 100x too much and you'll get a bullshit like this where everyone is an amazing fireman.

05:21

That's it for me. Go ship something cool and up to the next speaker. Oh, maybe, do I get a question? I don't know. One question. One burden question.

05:38

One burden question. So your clients conduct in job interviews with psychologists? Does that I think? Yes, that is a thing. If you're doing a job interview for, I don't know, CTO, summer, CEO, but also sometimes like for just like a fireman.

06:02

They would, as part of the interview process, they have like a sort of like a psyche valve thing. It's like an external company where they ship the candidate and they just want, you know, someone else to say like, okay, this person is not all fucked up. Like I think you're good.

06:18

And they use this software to make their job easier.