-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can these methods be applied to llama model training? #49
Comments
Hello @AnonymityGithub, GFO is not designed to be used as an optimizer in deep-learning frameworks (like pytorch or tensorflow), because its optimization-algorithms work without utilizing the gradient (or 1. derivative) of a function evaluation. In theory it is still possible to use GFO for this task, by not using the derivative and only using the loss itself to adapt the weights of the neural-network. This problem with this is: Problems in which you are able to get the derivative, should use it to get better results. It is important information you leave out, that would otherwise help your optimization-algorithm to find better solutions in shorter time. However, I think it would be interesting to expand my work in the future (because I really like writing optimization-algorithms). Maybe I will implement a separate package for optimizers in deep-learning frameworks at some point. I will close this issue, but feel free to ask followup questions of needed. |
Thanks vfor your reply. Using gradient-based methods to optimize the model is indeed the best solution. However, after the emergence of LLMS, gradient-based optimization methods require larger memory, so gradient-free methods have emerged to optimize the model, so I am curious whether these methods can be used on deep learning models. |
Intriguing! I was not aware of this. Within the next days I will do some research. If you have some articles or other sources you could post them here.
I would be very motivated in trying to implement a short "Proof of Concept". Do you have an idea, which deep-learning framework to use for this? |
I think pytorch+huggingface is better. Huggingface provides most of the deep learning models. some papers: MeZO: Fine-Tuning Language Models with Just Forward Passes Thanks. |
I did some reading on this topic and worked out a concept, how to apply GFO to a pytorch model. As a first step I would like to apply GFO to a simple model (a small CNN or MLP). |
Great!If successful, it can also be applied to more complex models. |
So, I did some tinkering with pytorch and managed to create a very simple example, how GFO could be used as an custom optimizer. Disclaimer: I don't use pytorch on a regular basis. So this code is probably of bad quality. It is just a showcase. About the results:
About the speed: About the very high initial loss: I would appreciate very bit of help or hints about this showcase. I ran this code with the following versions: import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
from gradient_free_optimizers import (
HillClimbingOptimizer,
RandomSearchOptimizer,
RepulsingHillClimbingOptimizer,
PowellsMethod,
)
# Define a synthetic dataset
np.random.seed(42)
X = np.random.rand(1000, 20)
true_weights = np.random.rand(20, 1)
y = X @ true_weights + 0.1 * np.random.randn(1000, 1)
X = torch.Tensor(X)
y = torch.Tensor(y)
# Create a DataLoader
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
num_epochs = 100
# Define a more complex neural network
class ComplexModel(nn.Module):
def __init__(self):
super(ComplexModel, self).__init__()
self.network = nn.Sequential(
nn.Linear(20, 64),
nn.ReLU(),
nn.Linear(64, 64),
nn.ReLU(),
nn.Linear(64, 1),
)
def forward(self, x):
return self.network(x)
# Initialize the model
model = ComplexModel()
# Define a loss function
criterion = nn.MSELoss()
# Define the custom optimizer with GFO
class GFOOptimizer(torch.optim.Optimizer):
def __init__(self, params, model, dataloader, criterion, lr=1e-3):
self.model = model
self.dataloader = dataloader
self.criterion = criterion
self.lr = lr
self.nth_iter = 0
# Flatten the initial model parameters
self.params = []
for param in self.model.parameters():
self.params.extend(param.data.cpu().numpy().flatten())
# Define the search space
self.search_space = {
f"x{i}": np.arange(-1.0, 1.0, 0.1, dtype=np.float32)
for i in range(len(self.params))
}
# Initialize the GFO optimizer
self.optimizer = PowellsMethod(self.search_space)
self.optimizer.init_search(
objective_function=self.objective_function,
n_iter=num_epochs * len(dataloader),
max_time=None,
max_score=None,
early_stopping=None,
memory=True,
memory_warm_start=None,
verbosity=[],
)
defaults = dict(lr=lr)
super(GFOOptimizer, self).__init__(params, defaults)
def objective_function(self, opt_params):
opt_params_l = list(opt_params.values())
# Set model parameters
start = 0
for param in self.model.parameters():
param_length = param.numel()
param.data = torch.tensor(
opt_params_l[start : start + param_length]
).view(param.shape)
start += param_length
# Compute the loss
total_loss = 0.0
with torch.no_grad():
for batch_X, batch_y in self.dataloader:
outputs = self.model(batch_X)
loss = self.criterion(outputs, batch_y)
total_loss += loss.item()
return total_loss / len(self.dataloader)
def step(self, closure=None):
if closure is not None:
closure()
# Use GFO to find the best parameters
self.optimizer.search_step(self.nth_iter)
best_params = self.optimizer.pos_best
# Set the best parameters to the model
start = 0
for param in self.model.parameters():
param_length = param.numel()
param.data.copy_(
torch.tensor(
best_params[start : start + param_length],
dtype=torch.float32,
).view(param.shape)
)
start += param_length
self.params = best_params
self.nth_iter += 1
# Initialize the custom optimizer
optimizer = GFOOptimizer(
model.parameters(), model, dataloader, criterion, lr=0.01
)
# Training loop
for epoch in range(num_epochs):
for batch_X, batch_y in dataloader:
# Zero the gradients
optimizer.zero_grad()
# Forward pass
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
# Backward pass
loss.backward()
# Update the weights
optimizer.step()
# Print the loss for every epoch
print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}") |
OK. I am very happy to provide help; this may take some time. |
As mentioned before I identified two reasons for the bad result when using GFO as a pytorch optimizer so far. I am currently working on the slow performance of GFO and will continue focusing on this. Of course the goal is to achieve good performance for very high dimensional search-spaces. This might even require some public API changes. Like when creating the search-space: n_dims = 10000
search_space = {
f"x{i}": np.arange(-1.0, 1.0, 0.1, dtype=np.float32) for i in range(n_dims)
} Creating a search-space this way will get really slow for >1000 dimensions. Each iteration also slows down further. I will do some extensive performance testing within the next few days and weeks. To make this work will require a lot of effort, but I think it will payout. GFO has some very powerful algorithms and it would be awesome to see them in action as a pytorch optimizer! |
Sorry, I've been too busy lately. I can provide a text classification baseline that I haven't adapted to GFO yet. The accuracy of the baseline is around 94%. |
portalocker==2.8.2 |
You can test the performance of GFO based on this baseline. In the future, I may also do some tests to evaluate the feasibility of GFO. |
Hey 👋, I saw this library on HN a few days ago, and then found this issue.
I tried playing with this a bit, and the most important thing I noticed was the line |
Hello @mosheduminer, welcome to this project.
you are correct! The values in the search-space are converted into integers on optimizer level. So there are pos_best, values_best and params_best. Those are the current best variables. I used the name best_para for the parameters (a dict with the best values) for the overall best parameters of the entire search after the optimization run ends. That is why the optimizer does not have this variable at that step. I was quite oblivious to this mistake. It should have been obvious to me, since I implemented this. I uploaded the pytorch optimizer example to the dev-branch (so user cannot see this incomplete example): If you like you can add your contribution to the example. edit: |
Sure! For the moment though, normalizing pos_best, the loss is still way too high (though only double digits instead of some huge number), while the expected behavior is that the number should swiftly drop below 1.0. So there's more work needed here. |
@SimonBlanke after further experimenting (with the _pytorch_optimizer.py now), I think another possible issue is that in GFO, higher score (from Using BTW the example is missing the |
Hello @mosheduminer, you are correct about the maximization in GFO. I hesitated to update the example, because I thought you might want to open a PR. This way your contribution is on record (which is nice to have).
yeah I accidentally used a dev branch for GFO that does not require the I focused on the performance of different algorithms in the last few days and found similar results. Many hill-climbing based algorithms barely improve the loss. I also saw this jumping-behaviour in the loss. I went through every algorithm and expected the particle swarm optimizer to perform the best. Because of the attraction behaviour towards better performing particles I thought this could "imitate" something like a gradient in this large search-space. It showed much less of a jumping behaviour, but also failed to converge. From my perspective the main problem is the extremely high dimensionality of the problem. And for high dim. problems a gradient becomes even more valuable. Just compare a gradient-based approach to e.g. the pattern-search:
So for even a small neural network with ~1000 parameters we get a huge search-space, which leads to basically no improvement of the loss when using GFO for a reasonable training-time. At the moment I cannot see even one algorithm in Gradient-Free-Optimizers, that would be fit for this kind of problem. Sequence-model-based optimizers unfortunately cannot be used, because of the large search-space. Nevertheless I find this topic very interesting, but it should be approached from another angle. I see several alternatives, how to proceed:
|
Perhaps it'd be worthwhile to use fewer input dimensions and see if everything works then, to make sure that everything is correct implementation-wise. In that case I'd happily make a PR to track work on that. I suppose that it'd be great if you incorporated GFO knowledge into a ML specific project 🙂. |
@SimonBlanke The code below successfully optimizes the Neural Network to an MSE loss in the range of That said, an SGD optimizer would get to the same loss in less than 10 epochs (and each epoch is much faster with SGD), so I don't know about the value of this approach - I'm sure you and others with knowledge of DL would know more. Should I make a PR to the examples directory? import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
from gradient_free_optimizers import PowellsMethod
# Define a synthetic dataset
np.random.seed(42)
X = np.random.rand(1000, 20)
true_weights = np.random.rand(20, 1)
y = X @ true_weights + 0.1 * np.random.randn(1000, 1)
X = torch.Tensor(X)
y = torch.Tensor(y)
# Create a DataLoader
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
num_epochs = 1500
# Define a more complex neural network
class ComplexModel(nn.Module):
def __init__(self):
super(ComplexModel, self).__init__()
self.network = nn.Sequential(
nn.Linear(20, 64),
nn.ReLU(),
nn.Linear(64, 64),
nn.ReLU(),
nn.Linear(64, 1),
)
def forward(self, x):
return self.network(x)
# Initialize the model
model = ComplexModel()
# Define a loss function
criterion = nn.MSELoss()
# Define the custom optimizer with GFO
class GFOOptimizer(torch.optim.Optimizer):
def __init__(self, params, model, dataloader, criterion, lr=1e-3):
self.model = model
self.dataloader = dataloader
self.criterion = criterion
self.lr = lr
self.nth_iter = 0
# Flatten the initial model parameters
self.initial_weights = {}
self.params = []
counter = 0
for param in self.model.parameters():
self.params.extend(param.data.cpu().numpy().flatten())
for value in param.data.flatten():
self.initial_weights[f"x{counter}"] = (
value.item()
) # Convert tensor value to Python scalar
counter += 1
# Define the search space
self.search_space = {
f"x{i}": np.arange(-1.0, 1.0, 0.1, dtype=np.float32)
for i in range(len(self.params))
}
# Initialize the GFO optimizer
self.optimizer = PowellsMethod(
self.search_space, initialize={"warm_start": [self.initial_weights]}
)
self.optimizer.init_search(
objective_function=self.objective_function,
n_iter=num_epochs * len(dataloader),
max_time=None,
max_score=None,
early_stopping=None,
memory=True,
memory_warm_start=None,
verbosity=[],
)
defaults = dict(lr=lr)
super().__init__(params, defaults)
def objective_function(self, opt_params):
opt_params_l = list(opt_params.values())
# Set model parameters
start = 0
for param in self.model.parameters():
param_length = param.numel()
param.data = torch.tensor(opt_params_l[start : start + param_length]).view(
param.shape
)
start += param_length
# Compute the loss
total_loss = 0.0
with torch.no_grad():
for batch_X, batch_y in self.dataloader:
outputs = self.model(batch_X)
loss = self.criterion(outputs, batch_y)
total_loss += loss.item()
return -total_loss / len(self.dataloader)
def step(self, closure=None):
if closure is not None:
closure()
# Use GFO to find the best parameters
self.optimizer.search_step(self.nth_iter)
# best_params = self.optimizer.pos_new
best_params = self.optimizer.conv.position2value(self.optimizer.pos_new)
# print("self.optimizer.score_new", self.optimizer.score_new)
# Set the best parameters to the model
start = 0
for param in self.model.parameters():
param_length = param.numel()
# """
param.data.copy_(
torch.tensor(
best_params[start : start + param_length],
dtype=torch.float32,
).view(param.shape)
)
# """
start += param_length
self.params = best_params
self.nth_iter += 1
# Initialize the custom optimizer
optimizer = GFOOptimizer(model.parameters(), model, dataloader, criterion, lr=0.01)
# optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
# Training loop
for epoch in range(num_epochs):
for batch_X, batch_y in dataloader:
# Zero the gradients
optimizer.zero_grad()
# Forward pass
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
# Backward pass
# loss.backward()
# Update the weights
optimizer.step()
# Print the loss for every epoch
print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")
print("Training completed!") |
Hello @mosheduminer, thank you for continuing this experiment. The result is in line with what I would expect for gradient-free-optimization algorithms. |
Thanks again to everyone how participated in this issue. I will close this issue for now. I will keep an eye on this topic and will probably start a new project for this soon. If there are new papers or other information relevant to this topic, we can post it here. |
It would be great if it could support BERT, LLaMA and other model training.
The text was updated successfully, but these errors were encountered: