-
-
Notifications
You must be signed in to change notification settings - Fork 1k
GSoC_2019_project_usability
Let's make it easier for people to use and develop Shogun. For users, we would like to cover: user API & pipelining, parameters defaults and descriptions, exception handling, documentation & examples. In a second step, for scientists/developers, we would like to cover: plugin architecture, internal API, simplification.
- Sergey (github: lisitsyn, IRC: lisitsyn)
- Heiko (github: karlnapf, IRC: HeikoS)
- Wuwei (github: vinx13, IRC: wuwei)
- Gil (github: gf712, IRC: gf712)
Medium. The biggest challenge of this project is the vast scope -- you will touch a lot of Shogun's internals, both framework and ML code. Good planning required! Make sure that you pick a number of interesting topics in your application and make sure to show that you have a really good idea of what you want to achieve, the best usually being proof-of-concepts.
You need know
- C++ software design (not just fast for loops!), Python
- How Shogun's interfaces work, aka SWIG.
- Shogun's new parameter framework, aka tags
- Exception handling
- Machine Learning basics
- Other ML libraries (with good APIs)
Write user stories! Those are pseudo-code examples of how the user interacts with the library, and how that should look like. Your application needs to contain a couple of those (and ideas how to make them happen inside Shogun). Cover all the topics you want to address (see below), and try to be as precise as possible. See also below for more details.
Here are some sub-projects. We are open for more:
NOTE: A GSoC project will address multiple (or ideally all) of those topics.
We would like to put effort into cleaning and re-designing the current user API.
That is, the API that is accessed through SWIG and that is exposed via our examples. That is not the internal C++ API.
As a motivation, have a look at e.g. our very basic learner class: CMachine
.
Observe how many methods there are, and how confusing this must seem to new users.
Your task is to simplify this. This will include: renaming existing classes and methods, adding new methods, re-factoring existing classes, maybe even adding new classes.
Most important, we want to make the new API very minimal.
Steps:
- Have a look at these notes for some initial ideas and examples.
- Write a few complete user stories (API usage example, see the notes for some examples) for common ML cases. This requires some research: what are fundamental ML tasks that should definitely be covered, how do other libraries do it?
- Turn your insights on how the API should look like into a summary: a class diagram for example.
- Come up with a set of API changes in Shogun required to serve the user stories. This will include adding/renaming/removing.
- Work incrementally, one "use-case" at a time.
Topics to cover:
- Clean up our learner base class
CMachine
, and make it follow the de-facto standard offit/predict
, see here for some ongoing work. - Remove all "casting" methods from Shogun, they are not needed anymore since we have tags. I.e. remove all
::obtain_from_generic
calls, removeapply_regression, apply_binary
etc. - Give the new Converter/Transformer classes some love: write examples/notebooks, use them in pipelines, see what happens, fix the errors. Make using them easy!
- Implement some of the operator chaining ideas from the notes. This is a bigger topic and will require some code logic that implements the ideas. Could take a few weeks but results in a really cool improvement.
- Remove all
copy
methods (that create a copy of an instance), but rather implement copy constructors and rely onclone
for deep copies. - Put in methods to change the interface for algorithms that support multiple APIs (see below)
- There is way more here to do, but you get the idea :)
An example for a clean GMM API using as_*
gmm = sg.GMM()
gmm.algorithm = "split_and_merge_em"
gmm.algorithm = "em"
gmm.fit(features)
gmm.predict(features_test) # returns discrete labels, classification
gmm.as_classifier().predict(features_test) # same as above
gmm.as_distribution().predict(features_test) # returns the log-probability for each component for each data
gmm.as_distribution().as_mixture().get_component(idx) # returns a Gaussian component
gmm.as_distribution().sample(100) # returns 100 samples from the mixture
Currently, Shogun's exception handling is not ideal. It is just the same ShogunException
that is thrown and in some languages it causes the program to exit. Our error messages are sometimes good (if the developer was motivated), and sometimes quite bad -- they don't tell the user what she did wrong.
This part of the project is to introduce a small set of exceptions and populate Shogun with them (e.g. NotConverged
or InvalidState
) so in the code the following would be possible:
try:
svm.train()
except shogun.NotConverged:
...
except shogun.InvalidState:
...
The next step is to connect them to the SWIG interfaces. Some initial work has been done as part of last year's GSoC project by Wuwei
We would like to see all of Shogun's API covered in the meta examples (which also makes them be integration tested). We currently do lack examples (and cookbooks) for
- StringFeatures (see here for some initial work
- Fast SVMs in Shogun
- Dimensionality reduction
- many more ...
This project will involve writing at least 2-3 cookbooks per week (other projects need 2 examples without a cookbook), to increase coverage.
Machine Learning algorithms crucially depend on well-chosen parameters. While you can tune them automatically with Shogun (takes long though), a user sometimes simply wants to run an algorithm out of the box. Therefore, Shogun's default parameters should be sensibly chosen by the people who know what they are doing: the developers. Furthermore, users might be interested in what the parameters do, so we need to make it easy to read their descriptions without opening a web-browser.
In this part of the project, you will
- Make sure (i.e. test with real-world examples) that the default parameters of Shogun are well-chosen. Compare the choices to other libraries. One that does a particularly good job in sklearn.
- We also would like to add a mechanism for automatically inferring parameters from data, similar again to sklearn's
auto
string that can be passed to (numerical or not) parameters. We would like to offer a similar option, but obviously have to stay type safe, so we need to think about a nice design pattern here (strategy pattern?). - Implement a nice way to expose parameter documentation (currently done via doxygen, see the API) at runtime. This is likely to be done via the tags framework. We could for example see a Python script that reads parameter documents in tags and then makes sure they appear in the doxygen API. Example:
help(svm)
(we have that already, but it needs polish),help(svm.C)
. This should also work for all target inferfaces. There is some work around parameter descriptions being done here. - Update
@brief
descriptions of Shogun's algorithms (some are good, some are completely missing) - Add nice eye candy interface languages, for example code completion for IPython (example)
- We would like to integrate / merge all existing sources of documentation into a single one: cookbooks, API, parameter docs should all use the same content in order to improve maintainability (TODO: explain this better)
Anything else that sucks about using Shogun? Put it in here :)
You like thinking about API design? You like things to be neat? You enjoy exploring existing code-bases? You like to have an impact on Shogun?
This project will massively improve Shogun's usability, and therefore has a potentially significant impact on the project's user-base. You will get exposed to a lot of Shogun's internals and have a say in design decisions that will impact face of Shogun.