Skip to content

Latest commit

 

History

History
49 lines (35 loc) · 2.14 KB

README.md

File metadata and controls

49 lines (35 loc) · 2.14 KB

Large language models

Read more

[practice notes] Fine-grained inference

If for some reason you're not satisfied with model.generate interface, you can write your own inference code with iterative forward passes. Here's how it's done:

prefix = "Mark Zuckerberg is"  # same as above
batch = tokenizer(prefix, return_tensors='pt')
past_key_values = None
with torch.cuda.amp.autocast():
  for i in range(50):
    outputs = model.forward(**batch, use_cache=True, past_key_values=past_key_values)
    probs = outputs.logits[0, -1].div(0.8).softmax(-1)
    token = torch.multinomial(probs, 1).view([])

    print(tokenizer.decode(token), end=' ', flush=True)
    past_key_values = outputs.past_key_values
    batch = dict(input_ids=outputs.logits[0, -1].argmax(-1).reshape(1, 1),
                 attention_mask=torch.ones(1, past_key_values[0][0].shape[-2] + 1, device='cuda'))

[practice notes] How to optimize for inference

The code below converts training-optimized 8bit weights into inference-optimized layout. It should result in significantly faster inference in the same memory footprint. However, if you do this, you can no longer run training -- there is no way to un-convert after the first optimized forward!

model.config.use_cache = True
for module in model.modules():
    if isinstance(module, bnb.nn.Linear8bitLt):
        module.state.memory_efficient_backward = False