-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Training a camera position controlnet ? #699
Comments
@arthurwolf This is most likely not possible with the ControlNet implementation in this repo since it relies on converting the guidance into an image that the UNet can understand, it not really make sense to convert camera positions data such as rotation and position into a image format like you mentioned. This seems like it would make more sense to generate an image then control the camera position with NeRF (neural radiance fields) or a image-to-3D model like |
Thank a lot for the feedback.
My plan currently is to create a dataset by generating a large set of 3D
renders of scenes in various orientations, and associated keywords/tokens
for each of the parameters (camera_z_123, sunlit, sun_distance_25, etc ),
publish the dataset, and then let somebody more competent turn that into a
usable tool (and/or if nobody does, try to learn how to do it myself. but
start with the dataset as that's required no matter what).
…On Fri, Dec 13, 2024 at 11:47 PM JohnnyRacer ***@***.***> wrote:
@arthurwolf <https://github.com/arthurwolf> This is most likely not
possible with the ControlNet implementation in this repo since it relies on
converting the guidance into an image that the UNet can understand, it not
really make sense to convert camera positions data such as rotation and
position into a image format like you mentioned. This seems like it would
make more sense to generate an image then control the camera position with
NeRF (neural radiance fields) or a image-to-3D model like
https://huggingface.co/stabilityai/TripoSR. Directly using the diffusion
model to control camera angles would probably require some kind of text
conditioning on camera position data to guide the model.
—
Reply to this email directly, view it on GitHub
<#699 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA2SFNAC7YV5X5IGVUEDJL2FNPWJAVCNFSM6AAAAABM7ZP4ZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBSGUYDOOJXGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
勇気とユーモア
|
That's similar to creating a dataset for a NeRF model, which is very challenging since it requires quite a massive amount of high quality 3D renders (you don't want the diffusion model's image quality to degrade from conditioning on low quality renders) with a variety of novel views for them. I think it would be easier to add the positions as a caption from a VLM or a camera position estimator (OpenCV and such) on an existing text-to-image dataset (LAION-2B for example) rather than generating novel views from 3D renders. Which would require an immense number of 3D scenes and objects dataset (Objectverse dataset or something similar) or scenes to prevent overfitting but even then model could overfit to the actual visual information of the 3D scene rather than the camera angles you would want it to be conditioned on (remember the diffusion model has absolutely no idea what you are trying to make it learn, it is simply learning patterns in the data that are the most prominent) . |
So the idea here is in fact not to generate high quality randers, but to
render crappy quality renders, turn those into canny / lineart images, and
feed that to a controlnet to actually generate images in many styles
(realistic, anime, etc). Also planning to use ipadapter to "steal" the
style of existing images.
I've already tested this render -> canny -> controlnet -> generation
process, and it works very well, follows the geometry/orientation very
well, and it's pretty inexpensive to generate images in a lot of
orientations and styles.
…On Sat, Dec 14, 2024 at 9:30 PM JohnnyRacer ***@***.***> wrote:
That's similar to creating a dataset for a NeRF model, which is very
challenging since it requires quite a massive amount of high quality 3D
renders (you don't want the diffusion model's image quality to degrade from
conditioning on low quality renders) with a variety of novel views for
them. I think it would be easier to add the positions as a caption from a
VLM or a camera position estimator (OpenCV and such) on an existing
text-to-image dataset (LAION-2B for example) rather than generating novel
views from 3D renders. Which would require an immense number of 3D scenes
and objects dataset (Objectverse dataset or something similar) or scenes to
prevent overfitting but even then model could overfit to the actual
*visual* information of the 3D scene rather than the camera angles you
would want it to be conditioned on (remember the diffusion model has
absolutely no idea what you are trying to make it learn, it is simply
learning patterns in the data that are the most prominent) .
—
Reply to this email directly, view it on GitHub
<#699 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA2SFMVWPMRGAFYNB2JXRD2FSIO3AVCNFSM6AAAAABM7ZP4ZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBTGMZTKOBWGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
勇気とユーモア
|
That is very interesting as a dataset generation technique, but I am curious to how to ensure there is enough visual variety in the dataset for the model to still maintain the image generation quality across different prompts? Since the underrepresented images will have signifcantly lowered quality due to the model trying to generalize from a small sample. |
The plan is to take existing datasets (so millions of images), have visual
models like Florence2 describe the image (focusing on the style), and then
use that image description data as the basis for generating the "style" of
the generations.
That, plus static lists of "styles" (I have a list of like a 100), plus
seed-based randomness, plus inserting random tokens on top of that, I think
will give me some pretty good variety. It does in "manual" tests.
That's one thing that's nice about diffusion models, it's pretty easy to
get variation.
…On Sat, Dec 14, 2024 at 10:38 PM JohnnyRacer ***@***.***> wrote:
That is very interesting as a dataset generation technique, but I am
curious to how to ensure there is enough visual variety in the dataset for
the model to still maintain the image generation quality across different
prompts? Since the underrepresented images will have signifcantly lowered
quality due to the model trying to generalize from a small sample.
—
Reply to this email directly, view it on GitHub
<#699 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA2SFN6J7RTK6SUGEY4Y2T2FSQOPAVCNFSM6AAAAABM7ZP4ZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBTGM2TCOBTGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
勇気とユーモア
|
It would be cool to see a large captioned dataset paired with a structured output for the attributes you would want to control (like the ones you mentioned earlier: camera_z_123, sunlit, sun_distance_25, etc ). The tricky part is still generating enough 3D scenes and avoiding image degradation since diffusion models trained on generated outputs tends to degrade the image quality. Sounds like its a promising approach however. |
Hello!
Thanks for the amazing project.
I'm often in the situation where I've generated a scene I really like, but I'd like to rotate the camera a bit more to the right, or zoom in, or put the camera a bit higher up, etc.
Currently, the only way I found to do this would be to generate a 3D model of the scene (possibly automatically from a controlnet-generated depth map?), rotate that, generate a new depth map, and use that to regenerate the map.
But:
Another option somebody suggested, was training LORAs on specific angles, and have many LORAs for many different angles/camera position. But again, pretty cumbersome (and a lot of training). Also, not even sure if that'd work.
Or train a single LORA, but with a dataset that matches many different angle "keywords" to many different positioned images? As you can see, I'm a bit lost.
I figured what I really want to do, is manipulate the part of the model's "internal conception" of the scene that defines its rotation (if there is such a thing...). Like there has to be some set of weights that defines if we look at a subject from the front or the back, if a face is seen sideways or 3 quarters, etc.
So my question is, would it be possible to create a controlnet that would do this?
My main problem I see, is controlnet training, as described in
https://github.com/lllyasviel/ControlNet/blob/main/docs/train.md
takes images as an input.
But the input in my case wouldn't be an image, it would be an angle.
So my best guess of how to do this (and this is likely completely wrong), would be:
Would this work? Does it have any chance to? If not, is there any way to do this that would work?
Would a single 3D scene (or even a dumb cube on a plane) work, or do I need a large variety of scenes?
I would love some kind of input/feedback/advice on this.
Thanks so much to anyone who takes the time to reply.
Cheers.
The text was updated successfully, but these errors were encountered: