Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About music generation with perceiver-ar model #3

Open
feizc opened this issue Jun 29, 2022 · 6 comments
Open

About music generation with perceiver-ar model #3

feizc opened this issue Jun 29, 2022 · 6 comments

Comments

@feizc
Copy link

feizc commented Jun 29, 2022

Hi, @lucidrains

Thanks for the implementation of Perceiver-AR model.
We conduct the experiments on pop music generation at: https://github.com/feizc/Perceiver-Music-Generation.
The results are encouraging, be grateful to you : )

@lucidrains
Copy link
Owner

🎶🤖😄

@lucidrains
Copy link
Owner

@feizc how are you approaching the problem of generating starting from a length that is less than the prefix?

@feizc
Copy link
Author

feizc commented Jun 30, 2022

@feizc how are you approaching the problem of generating starting from a length that is less than the prefix?

Actually, I use a fixed length of conditional context, i.e., prefix length of prior music, to continue writing the next melody.

In my opinion, to start from zero, we can use special token like [pad] to supplement the prefix length, or only use decoder to generate an initial sentence then generate conditioned on latents.

I read the source code and find the author begin with zero :)


def gen_initial_events(): 

> events = np.zeros([device_count, batch_size, max_events_length], np.int32)

> events[:, :, 0] = dataset.SOS_ID 

> return events

@usryokousha
Copy link

After reviewing the current implementation (autoregressive_wrapper) it seems you generate each subsequent token one at a time as would be the case in most architectures. The authors of the perceiver-ar paper outlined a strided approach (typically the size of the self-attention sequence length) where the sampled tokens would be cached up to a certain size and then the buffer would be freed. Have you considered implementing this? The actual released implementation perceiver-ar is relatively easy to follow.

@lucidrains
Copy link
Owner

After reviewing the current implementation (autoregressive_wrapper) it seems you generate each subsequent token one at a time as would be the case in most architectures. The authors of the perceiver-ar paper outlined a strided approach (typically the size of the self-attention sequence length) where the sampled tokens would be cached up to a certain size and then the buffer would be freed. Have you considered implementing this? The actual released implementation perceiver-ar is relatively easy to follow.

noo not yet, i haven't implemented their special caching strategy at inference

but if i keep hearing more positive results, i may implement it! have to admit i was doubtful about the architecture initially

@usryokousha
Copy link

I’m curious to see how well this would work at inference, particularly when using a vqvae / vqgan to encode images. If you could decode in only several steps that would really speed up generation. I suspect quality would suffer, but the paper’s results seem promising w.r.t. to the ImageNet results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants