Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue while running vis_corex #15

Open
prasanna224 opened this issue Jun 12, 2019 · 3 comments
Open

Issue while running vis_corex #15

prasanna224 opened this issue Jun 12, 2019 · 3 comments

Comments

@prasanna224
Copy link

prasanna224 commented Jun 12, 2019

While running a file with the following arguments, I am getting an error after 24 hours of script run time.

Command:

python3 vis_corex.py /home/ppandey/dx_desc.csv --delimiter="|" --layers=32,16,8,1 --dim_hi dden=3 --missing=-1e6 -c -b -v -o dxm --ram=72 --cpu=36

Sample File:

DX101|DX110|DX115|DX118|DX142|DX143|DX155|DX160|DX166|DX169|DX175|DX184|DX196|DX212|DX215|DX218|DX222|DX223|DX234|DX235|DX239|DX253|DX254|DX267|DX271|DX275|DX277|DX278|DX279|DX295|DX298|DX310|DX315|DX332|DX335|DX342|DX343|DX344|DX356|DX385|DX386|DX399|DX404
8|0|1|6|0|0|0|0|0|0|0|0|5|0|3|0|0|6|0|453|0|0|0|2|0|0|6|0|0|0|9|4|6|0|0|1|1|0|9|0|0|41|81
0|4|0|0|0|4|1|0|53|0|0|2|0|0|1|0|0|0|0|0|0|4|0|0|0|0|3|0|0|0|0|0|11|0|4|0|0|0|0|0|7|0|0
0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0
0|0|0|0|1|0|0|0|0|0|0|0|0|0|9|0|0|3|0|0|0|0|0|0|0|0|0|2|0|0|2|0|25|0|0|0|0|0|0|0|2|0|0

Output:
`[-0. -0. -0. 0. 0. -0. 0. -0. -0. 0. 0. -0. 0. -0. nan -0.]
[ 0. 0. -0. 0. -0. -0. 0. -0. 0. 0. -0. -0. 0. 0. nan -0.]
[ 0. 0. 0. 0. 0. -0. -0. 0. 0. -0. -0. 0. -0. 0. nan -0.]

Overall tc: nan

Traceback (most recent call last):
File "vis_corex.py", line 777, in
n_cpu=options.cpu, ram=options.ram).fit(X_prev))
File "/home/usr/bio_corex/corex.py", line 171, in fit
self.fit_transform(X)
File "/home/usr/bio_corex/corex.py", line 220, in fit_transform
self.dict = best_dict
UnboundLocalError: local variable 'best_dict' referenced before assignment`

@gregversteeg
Copy link
Owner

Oh, that's disappointing. That error is caused by the "nan" in the output for TC (it's trying to find the best TC value, but "nan" is not comparable). If you put --verbose=2 you can see the TCs as you are running - then you might be able to see a nan arise earlier and stop it.
That question is, what causes the "nan"? Here are a few ideas to check for:

  • Are there any missing or non-numeric values in your data file? You can fill in missing/non-numeric with some value (I used -1e6) and then set the --missing=-1e6. I would suggest first trying a small simple model --layers=1 or --layers=2 while checking for issues with nans.
  • Extreme outliers could cause numerical overflow and nans.
  • If you really only have ~40 variables, you should use smaller models. --layers=10,3,1 for instance. Then look at the TCs and try a larger model --layers=12,4,1. Do the TCs for each layer go up or down? Usually, you see that TCs go up until you get to some optimal size then decrease again.

Not an issue, but you should add the option --no_row_names, since your first column is not an index.

Another possibility for your dataset is to "bin" the data and treat it as discrete. So for instance, you might set 0: 0, 1:1, 2: (any number greater than 1). Then run without the -c option (c to treat as continuous).

@gregversteeg
Copy link
Owner

One other suggestion.

This looks like count data. I've always meant to include a specific handling of count data, but haven't yet. One thing that works well for count data is to transform each value to log_2(1+x). The 0's and 1's stay the same, but the long tail of high counts is compressed inward. This makes the numerical modeling easier by reducing outliers.

@prasanna224
Copy link
Author

Thanks for your quick response. We will try the suggestions you have outlined here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants