About music generation with perceiver-ar model #3

feizc · 2022-06-29T12:05:08Z

Thanks for the implementation of Perceiver-AR model.
We conduct the experiments on pop music generation at: https://github.com/feizc/Perceiver-Music-Generation.
The results are encouraging, be grateful to you : )

lucidrains · 2022-06-29T13:08:08Z

🎶🤖😄

lucidrains · 2022-06-29T14:21:39Z

@feizc how are you approaching the problem of generating starting from a length that is less than the prefix?

feizc · 2022-06-30T01:38:08Z

@feizc how are you approaching the problem of generating starting from a length that is less than the prefix?

Actually, I use a fixed length of conditional context, i.e., prefix length of prior music, to continue writing the next melody.

In my opinion, to start from zero, we can use special token like [pad] to supplement the prefix length, or only use decoder to generate an initial sentence then generate conditioned on latents.

I read the source code and find the author begin with zero :)


def gen_initial_events(): 

> events = np.zeros([device_count, batch_size, max_events_length], np.int32)

> events[:, :, 0] = dataset.SOS_ID 

> return events

usryokousha · 2022-06-30T02:03:17Z

After reviewing the current implementation (autoregressive_wrapper) it seems you generate each subsequent token one at a time as would be the case in most architectures. The authors of the perceiver-ar paper outlined a strided approach (typically the size of the self-attention sequence length) where the sampled tokens would be cached up to a certain size and then the buffer would be freed. Have you considered implementing this? The actual released implementation perceiver-ar is relatively easy to follow.

lucidrains · 2022-06-30T15:31:11Z

After reviewing the current implementation (autoregressive_wrapper) it seems you generate each subsequent token one at a time as would be the case in most architectures. The authors of the perceiver-ar paper outlined a strided approach (typically the size of the self-attention sequence length) where the sampled tokens would be cached up to a certain size and then the buffer would be freed. Have you considered implementing this? The actual released implementation perceiver-ar is relatively easy to follow.

noo not yet, i haven't implemented their special caching strategy at inference

but if i keep hearing more positive results, i may implement it! have to admit i was doubtful about the architecture initially

usryokousha · 2022-07-01T09:22:23Z

I’m curious to see how well this would work at inference, particularly when using a vqvae / vqgan to encode images. If you could decode in only several steps that would really speed up generation. I suspect quality would suffer, but the paper’s results seem promising w.r.t. to the ImageNet results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About music generation with perceiver-ar model #3

About music generation with perceiver-ar model #3

feizc commented Jun 29, 2022

lucidrains commented Jun 29, 2022

lucidrains commented Jun 29, 2022

feizc commented Jun 30, 2022 •

edited

Loading

usryokousha commented Jun 30, 2022

lucidrains commented Jun 30, 2022

usryokousha commented Jul 1, 2022

About music generation with perceiver-ar model #3

About music generation with perceiver-ar model #3

Comments

feizc commented Jun 29, 2022

lucidrains commented Jun 29, 2022

lucidrains commented Jun 29, 2022

feizc commented Jun 30, 2022 • edited Loading

usryokousha commented Jun 30, 2022

lucidrains commented Jun 30, 2022

usryokousha commented Jul 1, 2022

feizc commented Jun 30, 2022 •

edited

Loading