Skip to content

Commit

Permalink
Merge pull request #321 from dpordomingo/extract-data-team-and-os
Browse files Browse the repository at this point in the history
Extract data team and os
  • Loading branch information
dpordomingo authored Nov 9, 2018
2 parents e43d7da + 966de0b commit 3b2b6e5
Show file tree
Hide file tree
Showing 10 changed files with 291 additions and 892 deletions.
277 changes: 45 additions & 232 deletions hugo/data/projects.yml
Original file line number Diff line number Diff line change
@@ -1,256 +1,69 @@
#
# README :
#
# It is accepted two different kinds of projects:
# - without documentation -> the project card will point to the repository source code
# - having documentation generated by [docsrv](src-d/docs) -> the project card will point to the docs
#
# PROJECT WITHOUT DOCS schema:
#
# - name: short name of the project as it will appear in the card title (i.e. kmcuda)
# url: link to the repository source code (i.e. //github.com/src-d/kmcuda)
# desc: project description that will appear in the project card ('/projects' page)
# logo: (optional) svg name of the icon (under 'static/img/icons') without the '.svg' extension.
# (i.e. projects/bblfsh for the image under static/img/icons/projects/bblfsh.svg)
# If it's not provided, it will use a random codepill
#
# PROJECT HAVING DOCS schema:
#
# - name: (SAME AS ABOVE)
# hostname: hostname of the documentation (without the protocol, i.e. engine.sourced.tech)
# url: link to the docs root (it should have the format '//hostname', i.e. //engine.sourced.tech)
# desc: (SAME AS ABOVE) plus description shown in the hero section at the project documentation site
# repository: identifier of the Github project with the format 'owner/project-name' (i.e. src-d/engine)
# minVersion: the minimum version of the project release containing documentation to be served. (i.e. v0.0.11)
# It will ensure that it will not be generated/served documentation for old releases
# languages: list of the languages whose API documentation will be generated
# (supported: python, cpp, scala, go)
# logo: (SAME AS ABOVE)
# logosmall: project nav icon (SAME AS DEFINED FOR $LOGO)
#
sections:

no_data_message: We still have things to polish here, soon to be released. Join our community to keep posted!
categories:
order: # all groups appearing here, will appear in the Landing
order: # only groups appearing here, will appear in the Landing in this order
- datasets
- models
- retrieval
- languages
- science
- demos
contents:
- applications

# ########## datasets ##########
collection:

datasets:
name: datasets
colors: {left: "#003ca1", right: "#656afa"}
desc: Output from our pipeline for source code analysis, our open datasets provide a ready-to-use baseline for your next machine learning and code analysis projects
title: Datasets
name: Datasets
projects:
# no datasets added so far as legacy ones didn't use the current pipeline

# ########## machine learning models ##########
- name: Public Git Archive
url: https://github.com/src-d/datasets/tree/master/PublicGitArchive

models:
name: models
colors: {left: "#e4415a", right: "#ff6d4c"}
desc: A selection of machine learning models trained using our tools over large datasets and ready to be used in your research or project with the supporting libraries
title: Models
name: Models
projects:
- name: id2vec
url: //github.com/src-d/models/blob/master/id2vec/92609e70-f79c-46b5-8419-55726e873cfc.md
desc: Source code identifier embeddings, where every identifier is represented by a dense vector; no splitting or stemming, later converted with quality loss
repository: src-d/models
- name: nbow
url: //github.com/src-d/models/blob/master/nbow/1e3da42a-28b6-4b33-94a2-a5671f4102f4.md
desc: Weighted bag-of-words where every word is a dense vector; trained over the code of the 140k top starred GitHub repositories
repository: src-d/models
- name: docfreq
url: //github.com/src-d/models/blob/master/docfreq/f64bacd4-67fb-4c64-8382-399a8e7db52a.md
desc: Document frequencies of code identifiers, i.e. how many projects contain a certain identifier after splitting & stemming; trained on 10M GitHub repos after de-duplication
repository: src-d/models
- name: topics
url: //github.com/src-d/models/blob/master/topics/c70a7514-9257-4b33-b468-27a8588d4dfa.md
desc: Topic modeling of Git repositories; trained over 10M GitHub repositories after de-duplication
repository: src-d/models

# ########## data retrieval tools ##########
- name: Topic Modeling
url: https://github.com/src-d/models#topics
- name: Identifier Embeddings
url: https://github.com/src-d/models#id2vec
- name: TF/IDF BoW
url: https://github.com/src-d/models#bow

retrieval:
name: data retrieval tools
colors: {left: "#317d19", right: "#4ecc7b"}
desc: A set of tools that allow you to discover, fetch, store, access, filter and extract features from just a single source code repository to tens of millions of repositories
title: Code Retrieval Tools
name: Code Retrieval
projects:
- name: engine
hostname: engine.sourced.tech
url: //engine.sourced.tech
desc: the source{d} engine combines data retrieval and language analysis tools for scalable pipelines that process any number of Git repositories for source code analysis
repository: src-d/engine
minVersion: v0.0.11
languages:
- python
- scala
logo:
logosmall:
- name: go-git
hostname: go-git.sourced.tech
url: //github.com/src-d/go-git
desc: go-git is a highly extensible Git implementation in pure Go language
repository: src-d/go-git
minVersion: v4.0.0
languages:
- go
logo:
logosmall:
- name: rovers
url: //github.com/src-d/rovers
desc: rovers is a service to retrieve repository URLs from multiple code repository hosting providers, similarly to a search engine crawler
repository: src-d/rovers
minVersion: v2.5.3
languages:
- go
- name: borges
url: //github.com/src-d/borges
desc: borges reads code repository URLs, then collects and stores them at large scale by using a producer-consumer architecture
repository: src-d/borges
minVersion: v0.7.1
languages:
- go
- name: śiva
hostname: siva.sourced.tech
url: //siva.sourced.tech
desc: śiva is an archiving format similar to TAR/ZIP, focused on allowing constant-time random file access, seekable access to contained files and concatenable files
repository: src-d/go-siva
minVersion: v1.1.3
languages:
- go

# ########## language analysis tools ##########
- name: go-git
url: https://github.com/src-d/go-git
- name: Rovers
url: https://github.com/src-d/rovers
- name: Borges
url: https://github.com/src-d/borges

languages:
name: language analysis tools
colors: {left: "#c2732a", right: "#f18406"}
desc: A toolset that enables you to identify with speed and precision programming languages from source code files and turn them into universal abstract syntax trees (UASTs)
title: Code Analysis Tools
name: Code Analysis
projects:
- name: babelfish
hostname: bblf.sh
url: //bblf.sh
desc: babelfish is a self-hosted server for universal source code parsing, turning code files into Universal Abstract Syntax Trees (UASTs)
repository: bblfsh/bblfshd
logo: projects/bblfsh
minVersion: v2.1.1
languages:
- go
- name: enry
hostname: enry.sourced.tech
url: //enry.sourced.tech
desc: enry is a faster source code file programming language detector based on github/linguist and toolbox that ignores binary or vendored files
repository: src-d/enry
minVersion: v1.5.2
languages:
- go
- name: babelfish tools
url: //github.com/bblfsh/tools
desc: babelfish tools are easy-to-use command line tools for simple code analysis, such as tokenizer, cyclomatic complexity, npath complexity, patch
repository: bblfsh/tools
languages:
- go

# ########## machine learning tools ##########
- name: Babelfish
url: https://doc.bblf.sh/
- name: Gitbase
url: https://github.com/src-d/gitbase
- name: Engine
url: https://github.com/src-d/engine
- name: Lookout
url: https://github.com/src-d/lookout

science:
name: machine learning tools
colors: {left: "#832fcc", right: "#c05cea"}
desc: Our ML tools range from feature extraction on top of source code abstract syntax trees to lightning-fast, large scale clustering algorithms running on GPUs
title: Machine Learning
name: Machine Learning
projects:
- name: ml
url: //github.com/src-d/ml
desc: sourced.ml provides a framework for Machine Learning on Source Code (MLoSC) over UASTs, including identifier embeddings, document frequencies, topic modeling and more
repository: src-d/ast2vec
minVersion: 0.3.5-alpha
languages:
- python
- name: modelforge
url: //github.com/src-d/modelforge
desc: modelforge is the foundation for storing and sharing machine learning models, with an extensible registry backend and using the ASDF storage format
repository: src-d/modelforge
minVersion: 0.3.1-alpha
languages:
- python
- name: kmcuda
url: //github.com/src-d/kmcuda
desc: kmcuda is a large-scale K-means and K-nn implementation that supports diverse distance metrics and can be accelerated using multiple NVIDIA GPUs (CUDA)
repository: src-d/kmcuda
minVersion: 6.2.0
languages:
- python
- cpp
- name: minhashcuda
url: //github.com/src-d/minhashcuda
desc: minhashcuda is a large-scale weighted MinHash implementation optimized for low memory and high speed by running on multiple NVIDIA GPUs (CUDA)
repository: src-d/minhashcuda
minVersion: 1.1.1
languages:
- cpp
- python
- name: wmd-relax
url: //github.com/src-d/wmd-relax
desc: wmd-relax is a large-scale Word Mover's Distance implementation optimized for speed by using google/or-tools that is compatible with spaCy
repository: src-d/wmd-relax
minVersion: v1.2.6
languages:
- python
- cpp

# ########## demos ##########

demos:
name: demos
colors: {left: "#ffba34", right: "#fff444"}
desc: Demos are use case examples based on our tech stack and which both help you to get started with them as well as provide real-world, concrete functionality
projects:
- name: dashboard
url: //github.com/bblfsh/dashboard
desc: babelfish dashboard is a visualization tool that uses the babelfish universal code parser to display UASTs and its details in a human-friendly manner
repository: bblfsh/dashboard
languages:
- js
- go
- name: vecino
url: //github.com/src-d/vecino
desc: vecino is a CLI app to discover the most similar Git repositories to the one provided through matching or synonymical source code identifiers
repository: src-d/vecino
languages:
- python
- name: tmsc
url: //github.com/src-d/tmsc
desc: tmsc is a CLI tool that applies topic modeling on source code to discover the topics of a repository the user provides
repository: src-d/tmsc
minVersion: 0.1.1-alpha
languages:
- python
- name: hercules
url: //github.com/src-d/hercules
desc: hercules (and its labours) calculates and displays various Git repository statistics as code burndown, developer ownership, file & developer copulas
repository: src-d/hercules
minVersion: v2
languages:
- go
- python
# further demos are either legacy or WIP

# ########## other projects not appearing in the landing ##########
- name: sourced.ml
url: https://github.com/src-d/ml

others:
name: others
colors: {left: "#888088", right: "#BBBBBB"}
desc: random and unrelared projects
applications:
title: Applications
name: Applications
projects:
- name: landing
hostname: landing.sourced.tech
url: //github.com/src-d/landing
desc: landing of source{d}
repository: src-d/landing
languages:
- js
- html
- css
- name: Gemini
url: https://github.com/src-d/gemini
- name: Hercules
url: https://github.com/src-d/hercules
Loading

0 comments on commit 3b2b6e5

Please sign in to comment.