You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So I am into attention network , one of the toughest to understand and book so far explains great, however on scaled dot product example, show scaling by 100 the product of ks,q skews. I extended this example by using actual key and query from earlier example (in p262, in order to compute dim which happens to be just 2) and compared non-scaled (p275) and scaled side by side. But on scaled one, it still seems to be a big variance wtih prod vs. 100*prod,, I was hoping to see similar despite prod is multiplied by 100 or must be donig something wrong....:
# This is small example showing difference it makes when dot product is not scaled
# and resulting skew in scoring values.
import copy
import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset, random_split, TensorDataset
import sys
sys.path.append('..')
from common.settings import *
from common.classes import *
from data_generation.square_sequences import generate_sequences
from stepbystep.v4 import StepByStep
from plots.chapter8 import plot_data
from plots.chapter9 import sequence_pred
import matplotlib.pyplot as plt
full_seq=(torch.tensor([[-1,-1],[1,-1], [1,1],[1,-1]]).float().view(1,4,2))
source_seq = full_seq[:, :2]
target_seq = full_seq[:, 2:]
torch.manual_seed(21)
encoder=Encoder(n_features=2, hidden_dim=2)
hidden_seq=encoder(source_seq)
values=hidden_seq
keys=hidden_seq
torch.manual_seed(21)
decoder=Decoder(n_features=2, hidden_dim=2)
decoder.init_hidden(hidden_seq)
inputs=source_seq[:, -1:]
out=decoder(inputs)
query=decoder.hidden.permute(1,0,2)
products = torch.bmm(query, keys.permute(0,2,1))
print("Non scaled")
print(F.softmax(products, dim=-1))
print(F.softmax(100*products, dim=-1))
print("Scaled")
dims=query.size(-1)
scaled_products=products/np.sqrt(dims)
print(F.softmax(scaled_products, dim=-1))
print(F.softmax(100*scaled_products, dim=-1))
Your code seems to be perfectly right. I am guessing the skewed values you're concerned about are just the effect of the softmax function. Softmax does skew values a lot when you increase the scale of the inputs, even if the proportion between the two input values is the same.
If we take two values, say, 0.01 and 0.1, the second is 10x larger than the first, but softmax will return fairly similar results for both:
However, if we multiply these values by 10, their proportion remains unchanged, but their overall level is 10x higher, thus affecting how softmax transforms them:
So, it all boils down to the fact that the softmax function, since it exponentiates the inputs in order to transform them into probabilities adding up to one.
Does this answer your question?
yes, i think so. it may be interesting to pursue this path, but i ;d move on. Just wanted to see if my understanding is correct through some sample code. btw, this appears simples explanation on softmax https://victorzhou.com/blog/softmax/
So I am into attention network , one of the toughest to understand and book so far explains great, however on scaled dot product example, show scaling by 100 the product of ks,q skews. I extended this example by using actual key and query from earlier example (in p262, in order to compute dim which happens to be just 2) and compared non-scaled (p275) and scaled side by side. But on scaled one, it still seems to be a big variance wtih prod vs. 100*prod,, I was hoping to see similar despite prod is multiplied by 100 or must be donig something wrong....:
Result:
The text was updated successfully, but these errors were encountered: