Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add marlin int4 kernel #333

Merged
merged 10 commits into from
Oct 10, 2024
Merged

Add marlin int4 kernel #333

merged 10 commits into from
Oct 10, 2024

Conversation

dacorvo
Copy link
Collaborator

@dacorvo dacorvo commented Oct 6, 2024

What does this PR do?

This adds a modified Marlin fp16/int4 kernel to the library and creates two new QTensor subclasses to use it:

  • MarlinInt4PackedTensor,
  • MarlinInt4WeightQBitsTensor.

There are issues with the weight/scales/zero-point readback as soon as parallelization increases. The consequence is that output features higher than 128 are corrupted when a sufficient amount of inputs are parallelized.

As a consequence, the AWQ kernel is still used despite lower performances as the number of tokens increases.

The code is however merged as is, and #332 is created to investigate the issues.

dacorvo and others added 10 commits October 10, 2024 12:07
Original fix in vLLM project:

The reason for the crash was the inline PTX assembly that introduced
the async_copy with streaming behavior. The solution is to use the more
standard PTX for async_copy (without the fractional L2 policy for
"evict_first"). There is no performance difference between standard
async_copy PTX and the previous one.
This is to guarantee Marlin kernels output is similar to the output
obtained using dequantized weights.
Adding more tests revealed a bug in the Marlin int4 kernel when the
weights and inputs are large enough.
Failing configurations are marked as xfail.
@dacorvo dacorvo merged commit 852bb9c into main Oct 10, 2024
16 checks passed
@dacorvo dacorvo deleted the add_marlin_int4_kernel branch October 10, 2024 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants