From 17fdc2eba5838d776980532745ae62edc890767f Mon Sep 17 00:00:00 2001 From: Joshua David Date: Mon, 15 Jul 2024 22:52:07 -0700 Subject: [PATCH] Update README.md to be more detailed --- README.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/README.md b/README.md index 28013bd..9ebdef6 100644 --- a/README.md +++ b/README.md @@ -67,6 +67,25 @@ An in-depth look at the structural modifications and their implications for mode The **LongRoPE** model architecture is designed to extend the context window of large language models (LLMs) to over 2 million tokens, addressing the limitations of traditional Transformer architectures. The key innovation lies in the progressive extension strategy and the adjustment of positional embeddings. +The LongRoPE model extends the context window of large language models beyond 2 million tokens. Key components include: + +1. Rotary Position Encoding (RoPE): + ```python + class RoPEPositionalEncoding(nn.Module): + def __init__(self, d_model, max_len=1000000, base=10000): + super().__init__() + self.d_model = d_model + self.max_len = max_len + self.base = base + self.theta = torch.tensor([base ** (-2 * (i // 2) / d_model) for i in range(d_model)]) + + def forward(self, positions): + angles = positions.unsqueeze(-1) * self.theta + sin_cos = torch.stack([angles.cos(), angles.sin()], dim=-1) + return sin_cos.view(*sin_cos.shape[:-2], -1) + + + ### Progressive Extension Strategy The architecture begins with a pre-trained LLM and extends its context window incrementally. Initially, the model is fine-tuned to handle a context length of 256k tokens. This progressive approach avoids the need for direct fine-tuning on extremely long texts, which are rare and computationally expensive to process. By gradually increasing the context length, the model can adapt more effectively to longer sequences.