Update pyramid_dit_for_video_gen_pipeline.py#100
Update pyramid_dit_for_video_gen_pipeline.py#100Quasimondo wants to merge 2 commits intojy0205:mainfrom
Conversation
Several optimizations that try to reduce memory allocations (so far only implemented for image-to-video)
|
Thank you for your contribution. I noticed that there are several changes in the file. Could you help me identify which are the critical ones related to memory leakage? I will merge them into the main branch. |
|
Oh yeah I realize that I should have made this in smaller steps . There is one main improvement -which is the changes inside of generate_i2v() which pre-allocate the generated_latents tensor before the loop and thus avoiding to create a list which then needs to be concatenated. In there I also delete a few objects after their use - not sure if it makes a difference since garbage collection should take care of them, but I don't think it makes it worse either. The other smaller change is to sample_block_noise() which now generates that tensor directly on the GPU - unfortunately it has to do it in float since "cholesky_cusolver" not implemented for 'BFloat16') There are several places where I replaced torch.cat([xy]*2) with repeat_interleave(2, dim=0) - not sure if that does much, but it also does not seem to hurt. And there are one or two places where I changed a calculation to run in-place |
|
No good sadly. Can't even make it past step 17 with this PR. Its using more and more memory every step till it uses all 24 GB and I run out. |
|
Well, can you run it without the patch and it works on your machine? |
|
Yes it works with or without the patch but in both cases it eventually runs out of memory and crashes. |
Implemented the pre-allocation of generated_latents also in the generate() method
|
Okay, it sounded like it does not work at all with the patch. Well, unfortunately this fix cannot work wonders. On my 24G I can do 31 frames at 384p, but I cannot do 768p at all (with or without patch) |
|
Oh. Gotcha! |
Several optimizations that try to reduce memory allocations (so far only implemented for image-to-video)
Tested it locally on my RTX 3090 and it seemed to reduce memory leakage, so that subsequent runs were possible without the machine locking up.