MetaForm is an open-source transformer model and library designed for high-performance natural language processing (NLP) tasks. This project provides a robust and flexible foundation for building and experimenting with transformer-based architectures, suitable for research and production environments.
- Modular Design: Easily customizable transformer blocks and models.
- Advanced Training: Built-in support for gradient checkpointing, mixed precision training, and distributed training.
- Scalability: Efficient memory management and parallelization for large-scale models.
- Flexibility: Tools for easy layer distribution across multiple devices, gradient aggregation, and parameter updates.
To install MetaForm, you can clone the repository and install the necessary dependencies:
git clone https://github.com/torinriley/metaform.git
cd metaform
pip install -r requirements.txt
Here’s a quick example to get started with building and training a transformer model using MetaForm:
from metaform.transformer.core import TransformerModel
from metaform.tools.matrix import Matrix
from metaform.tools.training import TrainingLoop
# Example usage
embed_size = 512
num_heads = 8
ff_hidden_size = 2048
num_layers = 12
dropout = 0.1
model = TransformerModel(embed_size, num_heads, ff_hidden_size, num_layers, dropout)
# Dummy input
input_data = Matrix.random(64, embed_size) # Batch size of 64
# Forward pass
output = model.forward(input_data)
print("Output shape:", output.shape())
This flowchart provides a detailed visualization of the MetaForm Transformer Model, specifically designed as a Large Language Model (LLM) for sequence generation tasks. It outlines the core components, from input preprocessing and positional encoding to the stacked transformer blocks and the intricate multi-head attention mechanism. The flow also includes critical decision points such as gradient checkpointing for memory optimization and the End-of-Sequence (EOS) token generation, which determines when the model completes generating text. The diagram showcases the decoder and softmax layers for token prediction, leading to the final sequence output. This flowchart serves as a clear representation of the entire data flow through the MetaForm LLM, offering insights into its architecture and sequence generation process.
- TransformerModel: The main transformer model class, allowing for the creation of custom transformer architectures with a flexible number of layers and attention heads.
- GradientCheckpointing: Utility for memory-efficient training through gradient checkpointing.
This matrix module was designed and built from scratch, providing a wide range of functionalities for matrix operations, including core matrix operations, algebraic manipulation, statistical analysis, and utility functions. The module is broken down into multiple components that interact seamlessly to allow flexible matrix computation, manipulation, and analysis.
The Matrix
class serves as the foundation of the module. It handles the creation and basic operations for matrices, including addition, subtraction, and multiplication, as well as element-wise operations.
- Initialization: Initialize matrices with lists of lists.
- Matrix Operations: Supports matrix addition, subtraction, multiplication, and transpose.
- Dimension Handling: Automatically handles row and column operations, ensuring the integrity of matrix dimensions.
The MatrixAlgebra
class provides algebraic operations such as calculating the determinant, finding the inverse of matrices, and computing matrix minors.
- Determinant Calculation: Computes the determinant for matrices recursively.
- Inverse Calculation: Handles matrix inversion, including edge cases like 2x2 matrices.
- Minor and Cofactor: Provides the ability to calculate the minor of a matrix and use it for more complex algebraic operations.
The MatrixRandom
class generates random matrices with specified dimensions and range for the elements.
- Random Matrix Generation: Create matrices filled with random integers within a specified range.
- Flexible Dimensions: Supports generation for matrices of any size.
The MatrixStatistics
class offers basic statistical analysis on matrices, including mean, variance, and standard deviation calculations.
- Mean: Computes the mean value of all elements in the matrix.
- Variance: Calculates the variance of the matrix values.
- Standard Deviation: Provides the standard deviation based on matrix elements.
The MatrixUtils
class includes utility functions to reshape, slice, and concatenate matrices.
- Reshape: Allows for matrix reshaping, as long as the dimensions match the total number of elements.
- Slicing: Enables extracting submatrices from larger matrices.
- Concatenation: Supports vertical and horizontal concatenation of matrices.
from matrix import Matrix
from algebra import MatrixAlgebra
from random import MatrixRandom
from statistics import MatrixStatistics
from utils import MatrixUtils
# Create a random 3x3 matrix
matrix = MatrixRandom.randint(3, 3, low=0, high=10)
# Calculate its determinant
det = MatrixAlgebra.determinant(matrix)
# Get mean and variance of the matrix
mean = MatrixStatistics.mean(matrix)
variance = MatrixStatistics.variance(matrix)
# Reshape the matrix into a 1x9 matrix
reshaped_matrix = MatrixUtils.reshape(matrix, 1, 9)
- MultiHeadAttention: Implements the multi-head self-attention mechanism.
- FeedForward: A fully connected feedforward network layer.
- Normalization: Layer normalization to stabilize training.
In the MetaForm LLM model, several layers are used to process the input data and transform it into meaningful output. This includes the Multi-Head Attention mechanism, Feedforward Neural Networks, and Layer Normalization, among other key components. Below is a mathematical breakdown of how these layers function.
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. For each head, the attention mechanism computes scaled dot-product attention using the following equation:
Where: ( Q ) (queries), ( K ) (keys), and ( V ) (values) are linear projections of the input embeddings. ( d_k ) is the dimensionality of the keys. ( \text{softmax} ) is applied to normalize the attention scores.
In multi-head attention, multiple sets of ( Q ), ( K ), and ( V ) matrices are learned, and each head computes its own attention output. These are then concatenated and linearly transformed:
Where: ( W_O ) is the output projection matrix. ( h ) is the number of attention heads.
Each attention head is calculated as:
Where:
are learned projection matrices for queries, keys, and values for each head.
After the multi-head attention mechanism, the output is passed through a Feedforward Neural Network (FFN). Each position in the sequence is processed independently by the FFN. The FFN consists of two linear transformations with a ReLU activation in between:
Where:
- ( W_1 ) and ( W_2 ) are weight matrices.
- ( b_1 ) and ( b_2 ) are bias terms.
- ( \text{ReLU}(x) = \max(0, x) ) is the Rectified Linear Unit activation function.
The FFN is applied to each position in the sequence separately but with the same parameters, making it position-wise feedforward.
Layer normalization is applied after the multi-head attention and feedforward networks to stabilize and speed up training. The layer normalization is computed as:
Where:
- ( \mu ) and ( \sigma ) are the mean and standard deviation of the activations.
- ( \epsilon ) is a small constant for numerical stability.
- ( \gamma ) and ( \beta ) are learned scale and shift parameters.
Each transformer block consists of multi-head attention followed by a feedforward network, with residual connections and layer normalization applied after both sub-layers:
Multi-head Attention:
Feedforward Network:
This architecture allows the transformer to capture complex dependencies between tokens in the sequence, facilitating effective sequence modeling for language tasks.
- Matrix Operations: Custom matrix operations designed to replace numpy for matrix manipulation, supporting core linear algebra operations.
- Activation Functions: Implementation of activation functions like ReLU, Sigmoid, and Tanh.
- Memory Management: Utilities for efficient memory usage during training.
- Distributed Training: Tools for training models across multiple GPUs or machines.
- Mixed Precision: Support for mixed precision training to speed up computations while saving memory.
- Parallelization: Methods to distribute layers and operations across multiple devices for efficient parallel processing.
Below is an example of how to use MetaForm for a basic training loop:
from metaform.transformer.core import TransformerModel
from metaform.tools.matrix import Matrix
from metaform.tools.training import TrainingLoop
# Initialize model
model = TransformerModel(embed_size=512, num_heads=8, ff_hidden_size=2048, num_layers=12)
# Dummy input data
input_data = Matrix.random(64, 512)
# Forward pass
output = model.forward(input_data)
# Implement your training loop here
training_loop = TrainingLoop(model)
training_loop.train(input_data)
Contributions are welcome! Please see the Contributing Guidelines for information.
MetaForm is released under the MIT License. See the LICENSE file