Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AbstractModel type & neural kernel network #94

Closed

Conversation

HamletWantToCode
Copy link
Contributor

@HamletWantToCode HamletWantToCode commented Mar 6, 2020

Overview

This PR contains neural kernel network and additional functionalities, it's an enhanced version of PR #92. While adding neural kernel network, this PR is intended to maintain Stheno's current APIs, and it also gives the 2nd & 3rd point we discussed here a try.

A quick summary of new functionalities:

  1. abstract_model.jl:
    • AbstractModel: base type for all parametric types
    • parameters: function used to extract parameters within a model to a 1D array
    • dispatch!: function used to redistribute parameters to the model
    • extract_gradient: function used to extracted gradients from the NamedTuple returned by Zygote.jl
  2. neural_network/basic.jl:
    • LinearLayer: linear transformation with positive weights
    • ProductLayer: element-wise kernel multiplication
    • Chain: chain linear layer and product layer together to build a sum-product network
  3. gp/neural_kernel_network.jl:
    • Primitive: kernel container
    • Neural_Kernel_Network

Usage

using Stheno
using Zygote

logl = log(0.4)
logσ² = log(1.3)

## kernels without constraints on it's parameters
# k = σ² * stretch(Matern52(), 1 / l)
## kernel with constraints on it's parameters
k = scale(stretch(Matern52(), logl, x->exp(-x)), logσ², x->exp(x))

gp = GP(k, GPC())

# extract parameters from GP model
ps = parameters(gp)

# assign new parameters back to GP model
new_ps = ps .+ randn(length(ps))
dispatch!(gp, new_ps)

const x = randn(100)
σ²_n = 0.05
gp_x = gp(x, σ²_n)
const y = rand(gp_x)
gs, = gradient(f->logpdf(f(x, σ²_n), y), gp)
grads = extract_gradient(gp, gs)
@assert length(grads) == length(ps)

example of neural kernel network can be find here.

To be discussed

  1. Necessity for adding AbstractModel: adding this type will introduce a tree structure for our model, makes it easy for us to extract and redistribute the parameters. parameters & dispatch! functions are based on it, these two functions facilitate us using both Flux's stochastic optimizer and Optim.jl's traditional optimizer. When dealing with realistic problems, our model may contains a large number of parameters, manually writing unpack and repack function each time may be annoying.
  2. For neural kernel network, use Flux's NN v.s. Develop our own: since type of layer can be used for neural kernel network is limited, I suggest to maintain a minimum neural network module specifically for neural kernel network feature.

All tests passed on my laptop ( macOS, julia 1.3.0 )

@codecov
Copy link

codecov bot commented Mar 7, 2020

Codecov Report

Merging #94 into master will decrease coverage by 6.27%.
The diff coverage is 59.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #94      +/-   ##
==========================================
- Coverage   88.55%   82.27%   -6.28%     
==========================================
  Files          24       27       +3     
  Lines         690      801     +111     
==========================================
+ Hits          611      659      +48     
- Misses         79      142      +63
Impacted Files Coverage Δ
src/composite/compose.jl 61.53% <ø> (ø) ⬆️
src/Stheno.jl 100% <ø> (ø) ⬆️
src/composite/composite_gp.jl 92.59% <0%> (-7.41%) ⬇️
src/util/parameter_handler.jl 0% <0%> (ø)
src/gp/gp.jl 88.23% <0%> (-11.77%) ⬇️
src/gp/mean.jl 85.71% <60%> (-14.29%) ⬇️
src/gp/neural_kernel_network.jl 61.9% <61.9%> (ø)
src/neural_network/basic.jl 76.47% <76.47%> (ø)
src/gp/kernel.jl 92.89% <84.21%> (-6.44%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c2299b...38c5786. Read the comment docs.

@HamletWantToCode
Copy link
Contributor Author

Hi @willtebbutt , just wondering your idea on this PR, if we are comfortable with these design, I'll move on.

@willtebbutt
Copy link
Member

willtebbutt commented Mar 10, 2020

Hi @HamletWantToCode . Just getting around to looking at this now after being busy last week.

Firstly, thanks for putting so much work into this contribution.

Before I dive into reviewing specifics, I wonder if you could explain why Flux's @functor isn't sufficient for all of the parameter handling stuff that you've added?

See eg. here (the Layer Helpers bit) and here.

@HamletWantToCode
Copy link
Contributor Author

Hi @willtebbutt , thanks for the reply, hope I don't disrupt you.

Here is what I think:

Since Flux is build for neural network, it's @functor:

  1. Requires type of the trainable parameter inside a struct to be AbstractVecOrMat. While currently, many kernel parameters in Stheno are scalar ( e.g. PerEQ's l & RQ's α ). Therefore, using @functor requires us to change some of Stheno's currently implementation.

  2. Doesn't flatten parameters to an AbstractVector, instead, It extracts parameters to a Zygote.IdSet, one still need to write additional code to do flattening job and assemble gradients. This inconvenience will stand out when we use optimizers in Optim.jl, since they require all parameters to be included in a 1d array. Typically in Gaussian process, the size of dataset isn't that large, we can use BFGS method in Optim.jl to do optimization since it typically has better convergence properties than, e.g., the ADAM optimizer.

The newly added parameters function solves these problems:

  1. It allows parameters inside a struct to be scalar, vector or matrix, therefore, we don't need to change Stheno's internal implementation. We only need to add child & get_iparam interface for Stheno's current GPs and kernels. In my opinion, this won't result in any side effect. ( Note: in this PR, the type of kernel parameters is temporarily changed to AbstractVecOrMat, this has nothing to do with parameters method, it's only due to current requirement of dispatch! method )

  2. It extracts all parameters into a 1d array, which allows us to use Optim.jl seamlessly, and Flux's optimizers also work in this case.

@willtebbutt
Copy link
Member

No problem at all. I just took longer than I would have liked to take a look at the work, and didn't want to give you the impression I'm not enthusiastic about what you're doing (I'm very excited about it!)

Your explanation seems very reasonable to me, and I think that I agree with the overall principle.

A couple of initial comments:

  • I think it would be better to avoid adding a new AbstractModel type entirely. We could have a system that just requires a model to implement an interface that we specify, as opposed to imposing the constraint of a model having to sub-type a particular model. This means that code that was written without this kind of parameter handling in mind can be extended to handle it without having to be re-written. Am I missing anything that means we really need a new type here?
  • Furthermore, could we modify the default behaviour slightly? In particular, it would be really nice if types that don't have any children or parameters don't have to have anything implemented. So instead of the fallback implementations of child and iparams erroring, could they return empty tuples?

@HamletWantToCode
Copy link
Contributor Author

I totally agree with your idea.

avoid adding a new AbstractModel type entirely

Thanks for pointing this out, I reread my code, AbstractModel seems to be redundant, parameters, dispatch! & extract_gradient still work after get rid of it. ( I will verify this later )

instead of the fallback implementations of child and iparams erroring, could they return empty tuples?

Yes, they currently just return empty tuples, you can run following code to check:

eq = EQ()   # EQ kernel don't have any parameters
Stheno.child(eq)  # will return (), a empty tuple
Stheno.get_iparam(eq)  # will return 0-element Array{Union{},1}

@willtebbutt
Copy link
Member

Thanks for pointing this out, I reread my code, AbstractModel seems to be redundant, parameters, dispatch! & extract_gradient still work after get rid of it. ( I will verify this later )

Great!

Yes, they currently just return empty tuples, you can run following code to check:

Ah, yes, I totally believe that's a thing. I would have preferred not to have to implement anything for EQ at all though. In the current implementation, EQ implements

get_iparam(::EQ) = Union{}[]
child(::EQ) = ()

It shouldn't be necessary to write this code at all!

@HamletWantToCode
Copy link
Contributor Author

Hi @willtebbutt:

I think it would be better to avoid adding a new AbstractModel type entirely.

AbstractModel type is removed.

I would have preferred not to have to implement anything for EQ at all though.

Now you don't need to implement child and get_iparam method for EQ like kernel.

Also add tests to NKN as we discussed in PR #95 .

@willtebbutt
Copy link
Member

@HamletWantToCode what are your plans for this PR now?

@HamletWantToCode
Copy link
Contributor Author

Like our previous discussion, I still think it's better for Stheno to have a parameters function, I’d like to continue modifying these APIs till they could be merged the master.

parameters should work for all parametric types in Stheno, now it’s able to deal with things inside gp folder, and I was considering to make it also work for things in composite folder. I need some time to figure out how composite GP work in Stheno, also I saw this here :

combine GPPP and pseudo-point approximations to do interesting things. This is a WIP -- it doesn't work properly yet.

I was also working on my thesis these days, and would like to use Stheno’s composite GP in my work :)

@willtebbutt
Copy link
Member

Sorry for the slow response -- I've been trying to figure out how I want things to proceed.

On the one hand, I agree that Flux's API isn't completely ideal for Stheno.jl, in particular the need to always make parameters heap-allocated is very frustrating.

On the other I'm very reluctant to roll our own interface that's specific to this package because one of the main selling points of the package is its interoperability with other packages in the ecosystem. While I'm sure something could be devised that could be made to work with other packages, it's unclear how straightforwardly this would apply in general. Moreover, KernelFunctions.jl is being designed to play nicely with the Flux API.

On the whole, I think its best that we go with the Flux API because, from a performance perspective, it's not too bad.

That being said, there's nothing to stop us creating a couple of types that help us with transforming input variables, but that themselves play nicely with Flux. We could open a separate issue to discuss that.

I was also working on my thesis these days, and would like to use Stheno’s composite GP in my work :)

Fantastic! Do you specifically need the pseudo-point stuff to work?

@HamletWantToCode
Copy link
Contributor Author

Hi @willtebbutt , it took me some time to digest your comment ;)

Totally agree with your idea on emphasizing the "interoperability with other packages in the ecosystem", and I think it's also better to determine which package should be included in that ecosystem, this will help us figure out how to modify the current API and make it clear what functionality should be added in the future.

Also, there are some questions

I agree that Flux's API isn't completely ideal for Stheno.jl, in particular the need to always make parameters heap-allocated is very frustrating.

Could you explain how this works, because someone told me @code_warntype can be used to determine whether parameters in a struct is heap-allocated or not ( when output of @code_warntype contains information that highlight in red, it means those variables are heap-allocated ), I use it on Flux layers and params functions, and find out that most of them are stack-allocated.

Moreover, KernelFunctions.jl is being designed to play nicely with the Flux API.

I play with KernelFunctions a little bit, it's good to see a package focusing on kernels. However, it seems that currently KernelFunctions and Stheno can't play with each other ( though you've introduce BaseKernel ), kernels in KernelFunctions.jl don't accept ColVecs, and Stheno's GP doesn't accept those kernels.

there's nothing to stop us creating a couple of types that help us with transforming input variables, but that themselves play nicely with Flux

While Stheno and KernelFunctions have nice APIs for input transformation ( e.g. Stretched, Scaled ), I think it's also important for us to pay attention to kernel hyperparameters, such as scaling factor and length scale, their values need more or less be constrained during the optimization. Therefore, besides types performing input transformation, we may also need types that set constraints to hyperparameters.

Do you specifically need the pseudo-point stuff to work?

Since my dataset isn't that large ( 2K~5K ), I think it's OK to use exact gp. I may use additive GP model and multi-output GP model, and I would like Stheno to work on these cases.

@HamletWantToCode
Copy link
Contributor Author

HamletWantToCode commented Apr 10, 2020

let's close this PR for now, we can always come back when we need.

@willtebbutt
Copy link
Member

Could you explain how this works, because someone told me @code_warntype can be used to determine whether parameters in a struct is heap-allocated or not ( when output of @code_warntype contains information that highlight in red, it means those variables are heap-allocated ), I use it on Flux layers and params functions, and find out that most of them are stack-allocated.

This isn't correct. @code_warntype shows you what types the compiler has inferred, and highlights when the compiler has not been able to infer a concrete type for a particular variable. That is, the can help you identify type instabilities, but they don't tell you anything more about stack vs heap. In our case I was just referring to the fact that you have to make all scalar values into length 1 vectors to make them work with Flux's parameter handling API. This is okay from a performance perspective because this tends not to be a bottleneck. It is, however, really ugly.

I play with KernelFunctions a little bit, it's good to see a package focusing on kernels. However, it seems that currently KernelFunctions and Stheno can't play with each other ( though you've introduce BaseKernel ), kernels in KernelFunctions.jl don't accept ColVecs, and Stheno's GP doesn't accept those kernels.

Yeah, this is currently the case. I've been holding off on switching Stheno over to KernelFunctions until it's at a level of maturity that is equivalent to Stheno, and we've settled on an API for KernelFunctions. I suspect a couple of things will wind up being named slightly differently to Stheno's conventions, but I'm anticipating it being broadly similar eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants