Skip to content

A Python script to collect C/C++ projects that stars >= 50 from GitHub using its API.

License

Notifications You must be signed in to change notification settings

DerZc/GitHub-Crawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 

Repository files navigation

GitHub-Crawler

A Python script to collect information from GitHub using its API.

This script allows to crawl information and repositories from GitHub using the GitHub REST API (https://developer.github.com/v3/search/).

Given a query, the script downloads for each repository returned by the query its ZIP file. In addition, it also generates a CSV file containing the list of repositories queried. For each query, GitHub returns a json file which is processed by this script to get information about repositories.

The GitHub API limits the queries to get 100 elements per page and up to 1,000 elements in total. To get more than 1,000 elements, the main query should be splitted in multiple subqueries using different time windows through the constant SUBQUERIES (it is a list of subqueries).

NOTE: please, take a look at the GitHub website to be sure that you do not violate any GitHub rule.

Dependency

The Dependency you need include:
wget
simplejson

You need to specify OUTPUT_FOLDER and OUTPUT_TXT_FILE in getDataFromGithub.py
The MINIMUM_PROJECT_NUM is the minimum project num

About

A Python script to collect C/C++ projects that stars >= 50 from GitHub using its API.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%