Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding frequentist numerical tests + Integrate StatsAPI #67

Merged
merged 7 commits into from
Dec 29, 2022

Conversation

ShouvikGhosh2048
Copy link
Collaborator

This PR adds frequentist numerical tests, and also integrates StatsAPI into CRRao.

@ShouvikGhosh2048
Copy link
Collaborator Author

@sourish-cmi
Currently this PR tests frequentist models on ordinary formulae and 0 intercept formulae. The tests just check whether the coeftables match.

  • Should we test on other formulae/datasets?
  • Should we test on more (or all) functions in src/frequentist/getter.jl?

@sourish-cmi
Copy link
Collaborator

  1. We should test for predict
  2. Some possible formulas:
    i) mpg~hp+hp^2 + wt + wt*hp
    ii) log(mpg)~log(hp)+log(wt)
  3. Later we can add few more datasets

@@ -221,7 +221,7 @@ function predict(container::FrequentistRegression{:LogisticRegression}, newdata:
return Probit_Link.(z)
elseif (container.link == GLM.CauchitLink)
return Cauchit_Link.(z)
elseif (container.link == GLM.Cloglog)
elseif (container.link == GLM.CloglogLink)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a change to src/frequentist/getter.jl - I think it should be CloglogLink instead of Cloglog.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay -- great. thanks.

@ShouvikGhosh2048
Copy link
Collaborator Author

  1. We should test for predict

    1. Some possible formulas:
      i) mpg~hp+hp^2 + wt + wt*hp
      ii) log(mpg)~log(hp)+log(wt)

    2. Later we can add few more datasets

I added predict and some formulae.

For Logistic, Negative binomial and Poisson should there be an example with log on the left?

@sourish-cmi
Copy link
Collaborator

  1. We should test for predict

    1. Some possible formulas:
      i) mpg~hp+hp^2 + wt + wt*hp
      ii) log(mpg)~log(hp)+log(wt)
    2. Later we can add few more datasets

I added predict and some formulae.

For Logistic, Negative binomial and Poisson should there be an example with log on the left?

No @ShouvikGhosh2048 for Logistic, Negative binomial and Poisson, we don't need a log on the left.

@sourish-cmi
Copy link
Collaborator

Perhaps we should check for AIC and BIC

> attach(mtcars)
> fit = lm(mpg ~ hp + wt + gear,mtcars)
> AIC(fit)
[1] 157.0528
> BIC(fit)
[1] 164.3815

Add it in the code and check if CRRao yields AIC = 157.0528 and BIC=164.3815

@ShouvikGhosh2048
Copy link
Collaborator Author

@sourish-cmi

using RDatasets, GLM, StatsBase

mtcars = dataset("datasets", "mtcars")
glm_model = lm(@formula(MPG ~ HP + HP^2 + WT + WT * HP), mtcars)
print(aic(glm_model))

gives 145.92297178935536

and

attach(mtcars)
fit = lm(mpg ~ hp + hp^2 + wt + wt * hp, mtcars)
AIC(fit)

gives 145.6109.

@sourish-cmi
Copy link
Collaborator

In your R-code, there is a syntax mistake.

fit = lm(mpg ~ hp + hp^2 + wt + wt * hp, mtcars)
AIC(fit)

The correct formula would be as follows:

> fit = lm(mpg ~ hp + I(hp^2) + wt + wt * hp, mtcars)
> AIC(fit)
[1] 145.923

@ShouvikGhosh2048
Copy link
Collaborator Author

library("Zelig")
data(turnout)
fit = glm(vote ~ age + I(age^2) + race + income * educate, family = binomial(link = "cloglog"), data = turnout)
sprintf("%.15f", AIC(fit))

gives 2022.884413390652753,
while

using GLM, RDatasets
turnout = dataset("Zelig", "turnout")
model = glm(@formula(Vote ~ Age + Age^2 + Race + Income * Educate), turnout, Binomial(), CloglogLink())
GLM.aic(model)

gives 2022.8844886368.

library("Zelig")
library(MASS)
data(sanction)
fit = glm.nb(num ~ target + coop + ncost, data = sanction)
sprintf("%.15f", AIC(fit))

gives 344.879549011806205,
while

using GLM, RDatasets
sanction = dataset("Zelig", "sanction")
model = glm(@formula(Num ~ Target + Coop + NCost), sanction, NegativeBinomial(), LogLink())
GLM.aic(model)

gives 365.8580428654269.

@sourish-cmi
Copy link
Collaborator

library("Zelig")
data(turnout)
fit = glm(vote ~ age + I(age^2) + race + income * educate, family = binomial(link = "cloglog"), data = turnout)
sprintf("%.15f", AIC(fit))

gives 2022.884413390652753, while

using GLM, RDatasets
turnout = dataset("Zelig", "turnout")
model = glm(@formula(Vote ~ Age + Age^2 + Race + Income * Educate), turnout, Binomial(), CloglogLink())
GLM.aic(model)

gives 2022.8844886368.

library("Zelig")
library(MASS)
data(sanction)
fit = glm.nb(num ~ target + coop + ncost, data = sanction)
sprintf("%.15f", AIC(fit))

gives 344.879549011806205, while

using GLM, RDatasets
sanction = dataset("Zelig", "sanction")
model = glm(@formula(Num ~ Target + Coop + NCost), sanction, NegativeBinomial(), LogLink())
GLM.aic(model)

gives 365.8580428654269.

Okay, it is an issue with GLM.jl. Can you please raise an issue in GLM.jl?

@ayushpatnaikgit
Copy link
Member

@mousum-github can you please check this?

@mousum-github
Copy link
Member

image

@mousum-github
Copy link
Member

mousum-github commented Nov 21, 2022

glm.nb estimates the \theta but glm with a Negative Binomial, you have to give theta and the default value is 1.
negbin is the corresponding function where \theta is estimated from the data.

@ShouvikGhosh2048
Copy link
Collaborator Author

ShouvikGhosh2048 commented Nov 22, 2022

@mousum-github

  1. For the Julia program in (2), what will the corresponding R program be?
  2. Do the programs in (1) match in what they should compute? If so, is the AIC difference significant?

@mousum-github
Copy link
Member

mousum-github commented Nov 22, 2022

@mousum-github

  1. For the Julia program in (2), what will the corresponding R program be?
    Not sure, you may try with negative binomial distribution as a family parameter in R glm function
  2. Do the programs in (1) match in what they should compute? If so, is the AIC difference significant?
    model = glm(@formula(Vote ~ Age + Age^2 + Race + Income * Educate), turnout, Binomial(), CloglogLink(); rtol=1.0E-16)
    GLM.aic(model)
    2022.8844131094252

@ShouvikGhosh2048
Copy link
Collaborator Author

library(Zelig)
library(MASS)
data(sanction)
fit = glm(num ~ target + coop + ncost, family = negative.binomial(theta=1), data = sanction)
sprintf("%.15f", AIC(fit))

gives 363.858042084877013, which is closer to Julia's value, but still significantly different.

  1. Should we remove this specific Cloglog example from tests? We can't set rtol in CRRao.
    The formula works for others links, and Cloglog tests with other formulae pass.
    We could also try other formulae with square/product terms until we get one which passes tests.

@sourish-cmi
Copy link
Collaborator

@ShouvikGhosh2048 for now, we can remove this example with cloglog. We can say on the front page that the test for cloglog link function is not completed yet. We have to move ahead with other tests. Particularly with the Bayesian test.

In this issue can you mention the completeness of the tests? or the cases that I have not considered yet !!

@ShouvikGhosh2048
Copy link
Collaborator Author

@ShouvikGhosh2048 for now, we can remove this example with cloglog. We can say on the front page that the test for cloglog link function is not completed yet. We have to move ahead with other tests. Particularly with the Bayesian test.

In this issue can you mention the completeness of the tests? or the cases that I have not considered yet !!

  1. I couldn't construct a R program for NegativeBinomial, so those tests are still incomplete.
  2. I can start working on Bayesian tests parallelly.

@sourish-cmi
Copy link
Collaborator

@ShouvikGhosh2048 @mousum-github

As discussed in yesterday's meeting we will focus on functional testing.

  1. @ShouvikGhosh2048 will develop the functional testing for the Bayesian module. For a given seed, the Bayesian module should reproduce the result.

  2. @mousum-github will correct the negative binomial model with negbin instead of glm and provide the test with R output

@sourish-cmi
Copy link
Collaborator

@ShouvikGhosh2048 Mousum da added the negative binomial test, and I approved and merged the test. Perhaps you pull the latest from the main branch. Please check.

@sourish-cmi
Copy link
Collaborator

@ShouvikGhosh2048 since you created separate PR for Bayesian tests. I believe now we can merge this PR. Please confirm.

@ShouvikGhosh2048
Copy link
Collaborator Author

I've updated the numerical tests for negative binomial. I haven't matched the AIC/BIC values with R - I'm not sure which values to pass to R's negbin to match Julia's negbin.

I tried:

library(aod)
library(Zelig)
data(sanction)
fit = negbin(num ~ 0 + target + coop + ncost, ~1, sanction)
AIC(fit)

This gives 344.896 compared to CRRao's 344.87955007518536.
Otherwise the PR is ready to be merged.

@sourish-cmi
Copy link
Collaborator

I've updated the numerical tests for negative binomial. I haven't matched the AIC/BIC values with R - I'm not sure which values to pass to R's negbin to match Julia's negbin.

I tried:

library(aod)
library(Zelig)
data(sanction)
fit = negbin(num ~ 0 + target + coop + ncost, ~1, sanction)
AIC(fit)

This gives 344.896 compared to CRRao's 344.87955007518536. Otherwise the PR is ready to be merged.

@ShouvikGhosh2048
okay, let me have a look and get back to you.

@sourish-cmi sourish-cmi added this to the Package Completeness milestone Dec 22, 2022
@sourish-cmi
Copy link
Collaborator

sourish-cmi commented Dec 22, 2022

I think it is okay - we can ignore it - it might be those denominator issues of AIC. For time being we can ignore it. I think we should go ahead and merge it. Please do let me know your thought.

This was linked to issues Dec 22, 2022
@ShouvikGhosh2048
Copy link
Collaborator Author

ShouvikGhosh2048 commented Dec 24, 2022

We can merge it for now (to make sure our frequentist models are consistent), but we should keep track of this.

@sourish-cmi
Copy link
Collaborator

I have filed the issue with the title AIC value for negbin regression does not matches with R. Otherwise, this PR is fine - so I am merging the PR with the main.

@sourish-cmi sourish-cmi reopened this Dec 28, 2022
Copy link
Collaborator

@sourish-cmi sourish-cmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All codes are checked and passed. Closing the PR - while opening a new issue

@sourish-cmi sourish-cmi merged commit c51b875 into xKDR:main Dec 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Package completeness Testing strategy, some thoughts
4 participants