GitHub

This is a repository crawler which outputs gmeta.json files for indexing by Globus. It currently supports OAI, CKAN, MarkLogic, Socrata, CSW, DataStream, and OpenDataSoft repositories.

Configuration is split into two files. The first controls the operation of the indexer, and is located in conf/harvester.conf.

The list of repositories to be crawled is in conf/repos.json, in a structure like to this:

{
    "repos": [
        {
            "name": "Some OAI Repository",
            "url": "http://someoairepository.edu/oai2",
            "homepage_url": "http://someoairepository.edu",
            "set": "",
            "thumbnail": "http://someoairepository.edu/logo.png",
            "type": "oai",
            "update_log_after_numitems": 50,
            "enabled": true
        },
        {
            "name": "Some CKAN Repository",
            "url": "http://someckanrepository.edu/data",
            "homepage_url": "http://someckanrepository.edu",
            "set": "",
            "thumbnail": "http://someckanrepository.edu/logo.png",
            "type": "ckan",
            "repo_refresh_days": 7,
            "update_log_after_numitems": 2000,
            "item_url_pattern": "https://someckanrepository.edu/dataset/%id%",
            "enabled": true
        },
        {
            "name": "Some MarkLogic Repository",
            "url": "https://somemarklogicrepository.ca/search",
            "homepage_url": "https://search2.somemarklogicrepository.ca/",
            "item_url_pattern": "https://search2.somemarklogicrepository.ca/#/details?uri=%2Fodesi%2F%id%",
            "thumbnail": "http://somemarklogicrepository.ca/logo.png",
            "collection": "cora",
            "options": "odesi-opts2",
            "type": "marklogic",
            "enabled": true
        },
        {
            "name": "Some Socrata Repository",
            "url": "data.somesocratarepository.ca",
            "homepage_url": "https://somesocratarepository.ca",
            "set": "",
            "thumbnail": "http://somesocratarepository.ca/logo.png",
            "item_url_pattern": "https://data.somesocratarepository.ca/d/%id%",
            "type": "socrata",
            "enabled": true
        },
        {
            "name": "Some CSW Repository",
            "url": "https://somecswrepository.edu/geonetwork/srv/eng/csw",
            "homepage_url": "https://somecswrepository.edu",
            "item_url_pattern": "https://somecswrepository.edu/geonetwork/srv/eng/catalog.search#/metadata/%id%",
            "type": "csw",
            "enabled": true
        },
        {
            "name": "Some DataStream Repository",
            "url": "https://somedatastreamrepository.org/dataset/sitemap.xml",
            "homepage_url": "https://somedatastreamrepository.org/",
            "item_url_pattern": "https://somedatastreamrepository.org/dataset/%id%",
            "type": "datastream",
            "enabled": true
        },
        {
            "name": "Some OpenDataSoft Repository",
            "url": "https://someopendatasoftrepository.ca/api/datasets/1.0/search",
            "homepage_url": "someopendatasoftrepository.ca",
            "type": "opendatasoft",
            "item_url_pattern": "https://someopendatasoftrepository.ca/explore/dataset/%id%",
            "enabled": true
        },
        {
            "name": "Some Dataverse Repository",
            "url": "https://somedataverserepository.ca/api/dataverses/%id%/contents",
            "homepage_url": "somedataverserepository.ca",
            "type": "dataverse",
            "enabled": true
        }
    ]
}

Right now, supported OAI metadata types are Dublin Core ("OAI-DC" is assumed by default and does not need to be specified), DDI, and FGDC.

You can call the crawler directly, which will run once, crawl all of the target domains, export metadata, and exit, by using globus_harvester.py. You can also run it with --onlyharvest or --onlyexport if you want to skip the metadata export or crawling stages, respectively. You can also use --only-new-records to only export records that have changed since the last run. Supported database types are "sqlite" and "postgres"; the psycopg2 library is required for postgres support. Supported export formats in the config file are gmeta and xml; XML export requires the dicttoxml library.

Requires the Python libraries docopt, sickle, requests, owslib, sodapy, and ckanapi. Should work on 2.7+ and 3.x.

Name		Name	Last commit message	Last commit date
Latest commit History 714 Commits
admin		admin
conf		conf
data		data
harvester		harvester
initscripts		initscripts
schema		schema
sql		sql
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
globus_harvester.py		globus_harvester.py
index_admin.py		index_admin.py
readme.md		readme.md
regression.sh		regression.sh
requirements.txt		requirements.txt
restapi.py		restapi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 5

Languages

License

axfelix/frdr_harvest

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages