14 minutes

AI Photobooth

An AI photobooth prototype, designed for a wedding, uses stable diffusion 1.5 to randomly modify photos without props. It runs locally on a laptop with an RTX 3070, employing face and depth masking with control nets, enhanced by the Imagine AI Room library.
00:00

So here, as you have maybe seen when you arrive, this is an AI photobooth prototype. So actually, it was a little side project, I think, for a friend for his wedding, like a year ago. It was a normal photobooth, so the typical one on the weddings now that you have some props on half mask and you take a group picture and it prints it with the date of the event and all that. And then I said, well, maybe another nice weekend project is instead of having to use props, use AI to modify the pictures in a random way just for fun. So yeah, it's also using stable diffusion, so let's run it. See? Sorry, you are going to be there.

01:03

I think once the focus-- OK, I'll be right here. Let's see how it goes. So let's see, the first generation takes a little while because it has to load the model. All is running locally in this laptop, it has a 3070 RTX. And this stable diffusion, 1.5 is the oldest model, but it's the fastest. So I care more about the speech of people are more born waiting for something to show up. And anyway, when you print it, it's like this, because you take three pictures and you put the original, the transform one, and you cannot see the details, really.

01:55

So let's see what comes up. Support is a pop art. So something that is very important is to try to keep the faces. So it doesn't do something like this. So how I do that. I use masking and how to detect the faces. It uses the clip segment model that you can tell it what you want to select just by text. So I can put-- this is the prompt I use, female face, or male face, or person face, or face, or hair, and the string. It's better to be that specific person I just put face, and especially with more than one person, it was no one getting it exactly.

02:58

So if you can be like, wiving what you want to match, it helps. And then I also added the hair, because what it was just getting the oval, and someone has a very distinctive hair, it's better to keep it. And it makes it more personal. Then I wanted to keep the shape of the picture, like the general composition. So for that, I use control nets. I tried a few. So yeah, I tried depth, open pose, canny, and to be honest, in the end, the best one was just depth.

03:43

Because when you use canny, it gets a lot of noise. If someone has a funky t-shirt, or the background has a lot of stuff going on, then the result becomes like a mess, especially in the background. It's like a soup. So I just want to keep the general position. So in general, the depth was working better just to keep the general composition of the photo. The problem is, in many cases, the hands look horrible. They look like morphing aliens with two fingers. Like a stable diffusion is famous for these kind of problems with hands. Let's see if it shows the difference that's because I find comment this, then it will show the different stages that the image goes through.

04:45

So let's try that. Kind of more t-bar more. OK, now maybe I will be the subject. That's it. I mean, it's good to be a bit scary, I think. [LAUGHTER] So let's-- [INAUDIBLE] [INAUDIBLE] Yeah. Let's see what comes out. All right, so you have a first generation. It's kind of slow, but after the model loads, usually it takes this long, just to fill the bar.

05:33

So the next one after the application is started is quite all right. So I think by now it should be generated. Look, that's either result. Maybe because I need all the debug mode, we cannot see. Here are the different stages. More straight and ready. Supposedly. Do it. Yeah. So here are the different stages. This is the-- well, let's see the arena first.

06:21

So this is what came out, as you can see. Yeah. [LAUGHTER] The main job there. I also don't know what's wrong. This was a superhero, maybe. I don't know. I have like-- [INAUDIBLE] Yeah, that one's it. Yeah, you're right. I have like 20 different wrongs here randomly. So then it gets the depth.

06:50

This is using midas, I believe, for the depth, since I was too close, it made it kind of flat. But when there's people a bit farther from the camera, it really gets the-- if the arms are forward, it really-- can get the depth. And this is the mask. So it protected my hair and face. Then I made it a bit softer. So then, blending, it's not so hard. So that's why you go that. [LAUGHTER] And yeah, this was the result.

07:34

So what at first I was doing, trying to do this with the diffusers, how did the first library directly? It was really difficult to join the different stages, like to do the masking on one side, the depth from another, because some others have different sizes of inputs and research in a bit. I found this library that is simply amazing. So I really recommend it if you want to experiment. It's called Imagine AI Room. And it has pipelines for a lot of things even for video now. And these guys all the time are the new features. So you can also change the models.

08:30

There's like-- you can use a stable diffuser on XL. I think it's going to be a stable diffuser on 3 now. Very easily, you can use all the different control nets just by some settings, the depth. You can do the style changes, the edit control net, and everything really, really easy. That's what you could see here. I could just create control net inputs, create the prompt. Here I have the prompt, the negative prompt. That doesn't really work great, because I could know the form hands, know too many fingers with fingers. It seems it doesn't care about that. Here, if I wanted to, for example, have depth and cutting.

09:24

I hope they're cutting here. You can even say the strength of each control net there. And I would just add it here to the array. Now I will do cutting as well. This is for the mask. So very easily, you can say keep what is in the mask or the other way. Edit what is in the mask. Also, there is four fixing faces that uses another model I tried it. It makes faces look better, but then it modifies the faces of the actual people. So people look, it's like they go WTF.

10:00

Actually, sometimes it's nice, sometimes it's like, who is this person? So in the NNI, I prefer to leave that out, but that was another feature. And yeah, there's really a lot of other stuff to play with very easily. And you can do it directly from Python API, or there's also command line. So even easier, you can just use command line parameters. And yeah, that was the little-- we can play it later if you want to play with it. It will be there with an actual printer, so you can take it home.

10:36

You want it for that. Nice memory for ninth manner, depending on how you go. [LAUGHTER] So yeah, that's it. Thanks. [APPLAUSE] Yeah. Do you try to mask the hands of what you do? No, I can put it for hands. Yeah, actually, it could be a good idea. Yes, I wanted it to be opposed to the previous-- about the photo I wanted to recreate. I don't want it to be too much similar to the original one, because it's for fun, right? But yeah, maybe I could have tried the mask for the hands, and it would be less creepy, perhaps.

11:19

Yes, hopefully it will understand the hand, right? Because if it doesn't think sometimes it happens. It doesn't know if it's the arm, so it makes it like part of the body or something else, and then suddenly there is a hand appearing there. Maybe it also makes it weird. I don't know. But yeah, it would be very easy. I can hold hand, and I suppose it would work. Nothing else? Quick one. Do you want your laptop on your friends' clinic, or do you want another laptop? No, it was actually for my friend's wedding.

11:54

It was in a Raspberry Pi, because it didn't need any AI. So just taking photos printing, all that, I could do in a Raspberry Pi. So it was even more self-contained. I just said, you need that to be there, and I just wrote a Raspberry Pi on some couple of cables in my pocket, so it was very easy setup. So the idea for this, I suppose you could do it API based, then if the Raspberry Pi can connect to Wi-Fi, or get internet connection, then maybe that's enough. But I wanted it to be fully self-contained. Because if you are-- I don't know if I would ever use this in a wedding.

12:37

It wasn't for fun this time. But if you are in the countryside or something, maybe you don't have good signals. So I wanted it to be self-contained. Yeah. Amongst all the AI art generators, like you also have mid-journey and other competitors, is stable diffusion the only one which offers an API? It's one of the few that are open source, that you have the weights, you can download, and you can use it locally. Mid-journey, you can only use it through the score, both. You have API.

13:19

Or API, but you cannot run it locally. Yeah. It's only remote. There are some others. There is Kandinsky that is like a Russian model, that one-- I tried also, but it was first block in my laptop. So I didn't use it, but it has better quality than stable diffusion of 1.5. Yeah. But in general, it's like stable diffusion started the open source image generation models. And now there's also more of them, like the stable diffusion Excel, the stable cascade, and now stable diffusion 3 medium that there really is recently.

14:01

So it's kind of the most developed. And that's why there's also so many extensions, like CondorNet, SolarIS. So it's like the widest community support. That's why. Then there is Dali from opening Iver. Again, you can only use the API. So those are only if you have internet access. I want to pay for it. All right. Thank you. So thank you, everyone. [APPLAUSE]