Skip to content

Google season of docs 2019

Heiko Strathmann edited this page Apr 23, 2019 · 2 revisions

Shogun project ideas for GSoD'19

About Shogun

Shogun is a library for efficient and unified Machine Learning (ML), with its main distinguishing feature being that its C++ core can be used from a wide range of languages (Python, R, Java, Octave, etc.) via a unified API. The library contains many state-of-the-art implementations of modern ML algorithms, as well as many algorithms that are not part of other libraries. The project is one of the oldest ML libraries (started 1999), and one of the largest ML codebases, see e.g. https://www.openhub.net/p/shogun for code statistics, with many scientific citations and users world-wide.

The state of Shogun's documentation

Shogun's documentation has been written by multiple generations of: core developers, external developers, GSoC students, interns, etc -- most on a volunteer basis -- as such, it is heavily fragmented. We made a heavy effort in 2015 to re-structure documentation in a way that it reflect different groups of users the project has. Those are

  • data-scientists / end-users, who simply want to run an algorithm from their favourite language. We have the cookbook API examples
  • scientists, who want to use the Shogun framework to implement their own algorithms.
  • developers, who extend the framework, make algorithms more efficient, learn about ML coding, etc.

The classes of documentation we developed are:

These classes of documentation have proven helpful, but what is lacking is a systematic stream of energy going into their improvements, development, and maintenance. As such, many of the above parts are partly incomplete. In addition, Shogun does not have a central user-guide style documentation that takes a new user by the hand going from running her first model, to building the source-code, to adding a new model using the framework, to changing/extending the framework itself.

Shogun's approach to documentation work

As code contributions, all documentation changes are made (and reviewed) via pull-requests on GitHub. All of our documentation is developed by the core devs and external contributors such as students. In all documentation that is generated by external contributors, we have a review process that

  • ensures coverage (new code needs to be documented, new parts of the framework need to be documented, GSoC projects need a real-world documented application, new algorithms need API examples, etc.)
  • ensures correct use of the English language, as many contributors are non-native this very important
  • iterates over produced documentation a handful of times to ensure it is clear

Shogun needs maintainable, lasting documentation

A big problem in documenting Shogun is its changing nature, and as a consequence its documentation outdates incredibly fast. We have seen this multiple times in the past: a part of Shogun is documented in an isolated fashion (e.g. in a readme), which is useful for a limited amount of time, until it becomes obsolete. There are multiple ways to remedy this. First, to write documentation that is as invariant as possible to changes in the code base. A simple example of this is to not write out method/function names in an API example, but rather refer to its semantic meaning, or even better a reference to a LaTeX-style label that is looked up automatically. Second, whenever code is part of documentation, it needs to be executable, and in particular part of a test that can be executed along with the usual CI in order to ensure that code snippets in the documentation are always up-to-date. Shogun's API cookbook follows this concept: a cookbook page consists of both a markdown .rst file and a code script where sections are surrounded by markers that can be referenced in the .rst file. As a result, our cookbook API examples can be part of the CI build and so we know that all code shown works. See e.g. https://github.com/shogun-toolbox/shogun/pull/3078/files for an example.

Contact

Please read our "getting involved"-guide for how to best get started working on Shogun, https://github.com/shogun-toolbox/shogun/wiki/Getting-involved

Project idea: An integrated Shogun user-guide

Sub-title: Integrate existing documentation, fill the gaps.

As outlined above, Shogun has three groups interacting with it: end-users, scientists, developers; and Shogun's documentation reflects this, resulting in a fragmented overall documentation. The goal of this project is to produce a Shogun user-guide that would guide a person new to Shogun along their path from downloading the project for the first time, to using it for data-science tasks, to adding new algorithms to the framework, to changing the framework.

This user guide

  • integrates all existing documentation into a single guide with three parts that build on each other
  • fills the gaps on individual documentation parts (missing reference API, cookbook example, build/framework instructions)
  • removes redundancy along different documentation parts
  • is maintainable, as outlined above.

Possible sub-topics include:

  • Writing a quick-start section that makes installation of Shogun and running first examples a matter of minutes.
  • Reviewing and improving all API cookbook examples for unified style, detail level, and references
  • Reviewing and improving all IPython notebooks for unified style, detail level, and references
  • Work on improved separation of concerns in all documentation blobs: each documentation part should address a single topic rather than information being spread inhomogeneously across different types of documentation.
  • Work on an installation/build guide that is test-able (see maintainable documentation)
  • Improving the structure and content of https://shogun.ml/ In particular to address our three user groups: end-users, scientists, developers
  • Producing a roadmap for engaging the community in further documentation work, in order to funnel their manpower (similar to our good-first-issue tasks)
  • Adding non-textual images or graphics to enhance the textual explanations
  • Updating out-of-date references and refactoring content to latest best practices
Clone this wiki locally