Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basenji_read overflow #196

Open
ElArquitectorgo opened this issue May 20, 2024 · 1 comment
Open

basenji_read overflow #196

ElArquitectorgo opened this issue May 20, 2024 · 1 comment

Comments

@ElArquitectorgo
Copy link

Hi,

I am trying to do a study similar to the Enformer study for my final thesis, and to do so I have downloaded 4505 Encode tracks.
When I ran the basenji_data script on I encountered the following error message numerous times

/mnt2/fscratch/users/ac_aux/vguirado/preprocess/bin/basenji_data_read.py:307: RuntimeWarning: overflow encountered in cast
  cov = self.cov_open.values(chrm, start, end, numpy=True).astype('float16')

The code:

#SBATCH --job-name=preprocess
#SBATCH --time=0-30:0
#SBATCH --mem=50G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
##SBATCH --ntasks-per-node=1
#SBATCH --constraint=cal

...more sbatch things...

time python bin/basenji_data.py -s .9 -g data/hg38_gaps.bed -l 196608 --local -o data/human -p 128 -v .1 -w 128 data/genome.fa data/human_data.txt

I would like to know if this can affect something to the generation of the TFRecords, since during the training I am finding an extremely strange behavior as I show below:

porlacara

This is by recovering a checkpoint at epoch 80 and training 50 more until 130 (first graph), recovering the checkpoint from epoch 130 until 180 (second) and from 180 until 230 (right). Here I'm using a small subsample, but the same happens with the whole dataset (and worse loss).

Apparently my training code is fine, because I have tried retrieving the Enformer checkpoint that is public and modifying the output to train the same subset and there I do get results. That is, I keep the trunk part already trained and add a single linear layer on top.

But starting from 0, and also including 1019 tracks for the mouse, the model is not able to learn anything. The values of R^2 are 0 or negative no matter how many steps I train.

So it occurs to me that the problem is in the generation of the TFRecords, but the only warning I found was that.

Thank you for your time.

@davek44
Copy link
Contributor

davek44 commented May 28, 2024

Hi, it appears that the tracks you downloaded have values above the float16 max. You could change the code to use float32, or explicitly clip the values. All active development on this software is now here: https://github.com/calico/baskerville

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants