Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Training a camera position controlnet ? #699

Open
arthurwolf opened this issue Aug 23, 2024 · 7 comments
Open

[Question] Training a camera position controlnet ? #699

arthurwolf opened this issue Aug 23, 2024 · 7 comments

Comments

@arthurwolf
Copy link

arthurwolf commented Aug 23, 2024

Hello!

Thanks for the amazing project.

I'm often in the situation where I've generated a scene I really like, but I'd like to rotate the camera a bit more to the right, or zoom in, or put the camera a bit higher up, etc.

Currently, the only way I found to do this would be to generate a 3D model of the scene (possibly automatically from a controlnet-generated depth map?), rotate that, generate a new depth map, and use that to regenerate the map.

But:

    1. This is cumbersome/slow, and
    1. This would only let me move the camera small amounts at a time

Another option somebody suggested, was training LORAs on specific angles, and have many LORAs for many different angles/camera position. But again, pretty cumbersome (and a lot of training). Also, not even sure if that'd work.

Or train a single LORA, but with a dataset that matches many different angle "keywords" to many different positioned images? As you can see, I'm a bit lost.

I figured what I really want to do, is manipulate the part of the model's "internal conception" of the scene that defines its rotation (if there is such a thing...). Like there has to be some set of weights that defines if we look at a subject from the front or the back, if a face is seen sideways or 3 quarters, etc.

So my question is, would it be possible to create a controlnet that would do this?

My main problem I see, is controlnet training, as described in

https://github.com/lllyasviel/ControlNet/blob/main/docs/train.md

takes images as an input.

But the input in my case wouldn't be an image, it would be an angle.

So my best guess of how to do this (and this is likely completely wrong), would be:

  1. Take a 3D scene.
  2. Render it at a specific angle/zoom/camera position.
  3. Take that generated image, and a text description of the camera position : angle212 height1.54 etc. Or maybe (angle 0.3) (height 0.25) ? I.e. play on the strength of the tokens? Something like that.
  4. Add each pair of generated image and corresponding position text to the dataset (completely ignoring the "black and white" input image)
  5. Generate thousands, train.

isometric-scene-with-3d-grocery-store-shop-vector-25647334

grocery store, shop, vector graphic, rotation-0.5, height-0.5, distance-0.3, sunrotation-0.2, sunheight-0.5, sundistance-1.0, 

Would this work? Does it have any chance to? If not, is there any way to do this that would work?

Would a single 3D scene (or even a dumb cube on a plane) work, or do I need a large variety of scenes?

I would love some kind of input/feedback/advice on this.

Thanks so much to anyone who takes the time to reply.

Cheers.

@arthurwolf arthurwolf changed the title [Question] Training a rotation controlnet [Question] Training a camera position controlnet ? Aug 23, 2024
@JohnnyRacer
Copy link

@arthurwolf This is most likely not possible with the ControlNet implementation in this repo since it relies on converting the guidance into an image that the UNet can understand, it not really make sense to convert camera positions data such as rotation and position into a image format like you mentioned. This seems like it would make more sense to generate an image then control the camera position with NeRF (neural radiance fields) or a image-to-3D model like https://huggingface.co/stabilityai/TripoSR. Directly using the diffusion model to control camera angles would probably require some kind of text conditioning on camera position data to guide the model.

@arthurwolf
Copy link
Author

arthurwolf commented Dec 14, 2024 via email

@JohnnyRacer
Copy link

That's similar to creating a dataset for a NeRF model, which is very challenging since it requires quite a massive amount of high quality 3D renders (you don't want the diffusion model's image quality to degrade from conditioning on low quality renders) with a variety of novel views for them. I think it would be easier to add the positions as a caption from a VLM or a camera position estimator (OpenCV and such) on an existing text-to-image dataset (LAION-2B for example) rather than generating novel views from 3D renders. Which would require an immense number of 3D scenes and objects dataset (Objectverse dataset or something similar) or scenes to prevent overfitting but even then model could overfit to the actual visual information of the 3D scene rather than the camera angles you would want it to be conditioned on (remember the diffusion model has absolutely no idea what you are trying to make it learn, it is simply learning patterns in the data that are the most prominent) .

@arthurwolf
Copy link
Author

arthurwolf commented Dec 14, 2024 via email

@JohnnyRacer
Copy link

That is very interesting as a dataset generation technique, but I am curious to how to ensure there is enough visual variety in the dataset for the model to still maintain the image generation quality across different prompts? Since the underrepresented images will have signifcantly lowered quality due to the model trying to generalize from a small sample.

@arthurwolf
Copy link
Author

arthurwolf commented Dec 14, 2024 via email

@JohnnyRacer
Copy link

It would be cool to see a large captioned dataset paired with a structured output for the attributes you would want to control (like the ones you mentioned earlier: camera_z_123, sunlit, sun_distance_25, etc ). The tricky part is still generating enough 3D scenes and avoiding image degradation since diffusion models trained on generated outputs tends to degrade the image quality. Sounds like its a promising approach however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants