Skip to content

Schema Extensions

Robert J. Gifford edited this page Oct 22, 2024 · 1 revision

GLUE Schema Extensions Overview

GLUE's core schema is designed to organize virus nucleotide data, serving as the foundation for the standard, built-in analysis functionality across all GLUE projects. While the core schema is essential, many analyses require additional, project-specific data items that are not part of the core schema. These auxiliary data objects often have structured relationships with the core schema, enhancing the value of the nucleotide data.

To accommodate these auxiliary data, GLUE allows schema extensions. Schema extensions enable the addition of new fields to existing objects, the creation of custom tables (new object types), and the establishment of custom relational links between tables. This flexible mechanism allows users to extend the database schema on a per-project basis, using a streamlined set of concepts and commands.

Potential Applications of Schema Extensions

Schema extensions in GLUE can be applied to model various types of data, including:

  • Host species of the virus, grouped taxonomically.
  • Patient-related data, such as sequences collected from the same patient at different times.
  • Associations between sequences originating from the same sample, e.g., different segments of a segmented virus.

Example: Lentivirus-GLUE m49 Schema Extension

The Lentivirus-GLUE project illustrates the use of schema extensions. In this project, several custom fields are added to the standard GLUE Sequence table. Additionally, each Sequence can be linked to a country of origin, a custom object type defined within the schema extension. Countries are further classified into geographical regions, based on the United Nations M49 standard. This standard, also known as the "UN M49," assigns numeric codes to countries, areas, and regions for statistical analysis and comparison. These codes are widely used in international data exchange, statistical reporting, and analysis. The M49 classification includes:

  • Country codes: Numeric codes assigned to individual countries.
  • Area codes: Codes assigned to regions or areas that may encompass multiple countries.
  • Regional groupings: Codes for groups of countries based on geographical, economic, or other criteria.

For example, the United States has an M49 code of 840, and the European Union has a code of 155. These codes facilitate the organization and comparison of statistical data across different countries and regions. Incorporating the M49 standard into GLUE projects allows for standardised analyses of virus sequence variation by global region of isolation.

The Entity-Relationship diagram below demonstrates the schema extension used in Lentivirus-GLUE.

Lentivirus-GLUE schema extensions

Key Aspects of Schema Extensions

The key elements involved in working with schema extensions are:

Defining Custom Tables and Fields

A project's schema is extended by executing commands in the Schema mode. In the Lentivirus-GLUE project, all schema extension commands are executed from the exampleSchemaExtensions.glue file, with M.49-related extensions handled in the m49SchemaExtension.glue file.

For example, to define custom tables for the regions, sub-regions, intermediate regions, and countries as per M.49, the following commands are used:

create custom-table m49_region
create custom-table m49_sub_region
create custom-table m49_intermediate_region
create custom-table m49_country

Each custom table can contain data objects that exist alongside standard GLUE objects like Sequences and Alignments.

Custom tables must have an ID field, which serves as the unique identifier for each row. Additional fields are created using the create field command. For instance, the m49_country table might be defined as follows:

table m49_country
  create field m49_code INTEGER
  create field display_name VARCHAR 100
  create field full_name VARCHAR 100
  create field is_ldc BOOLEAN
  create field is_lldc BOOLEAN
  create field is_sids BOOLEAN
  create field development_status VARCHAR 20
exit

Custom fields can also be added to standard GLUE object types, using the same create field command within the Table mode. In Lentivirus-GLUE, several custom fields are added to the Sequence table:

The Lentivirus-GLUE project contains a schema extension related which we will use for illustration. Firstly, various custom fields are added to the standard GLUE Sequence table. In addition Sequences may each be linked with a country of origin, which is defined as a custom object type. Countries are classified and grouped into geographical regions (also custom objects) based on the United Nations M.49 system. This could be used for example to analyse sequence variation in association with global region of origin. The schema extension is shown in the Entity-Relationship diagram below.

table sequence
  create field gb_gi_number VARCHAR
  create field gb_primary_accession VARCHAR
  create field gb_accession_version VARCHAR
  create field gb_locus VARCHAR
  create field gb_organism VARCHAR
  create field gb_create_date DATE
  create field gb_update_date DATE
  create field gb_taxonomy VARCHAR 200
  create field gb_pubmed_id VARCHAR
  create field length INTEGER
  create field isolate VARCHAR
  create field host_species VARCHAR 200
  create field genotype VARCHAR 10
  create field subtype VARCHAR 10
exit

Populating Custom Tables and Fields

Once defined, custom tables and fields can be populated with data using GLUE commands. For example, to add a new country "Ruritania" (ISO code RUR) to the m49_country table:

GLUE> project example
GLUE> create custom-table-row m49_country RUR
GLUE> custom-table-row m49_country RUR
GLUE> set field full_name 'Kingdom of Ruritania'
GLUE> set field m49_code 998
GLUE> set field display_name Ruritania
GLUE> set field development_status developed

Querying Custom Tables and Fields

You can query custom objects and fields using commands such as show property and list property. For example, to query the properties of Ruritania:

GLUE> show property development_status
GLUE> list property

To list custom objects, like M.49 sub-regions, and configure result columns:

GLUE> list custom-table-row m49_sub_region id m49_code display_name

Defining Custom Relational Links

Relational links model relationships between objects, such as linking a Sequence to its country of origin. Links are created using the create link command, specifying the source and destination tables:

create link sequence m49_country --multiplicity MANY_TO_ONE

Additional links can be defined, such as:

create link m49_country m49_region --multiplicity MANY_TO_ONE
create link m49_country m49_sub_region --multiplicity MANY_TO_ONE

These links align with the associations shown in the schema extension's entity-relationship diagram.

Source and Destination Link Names

Relational links have source and destination link names, which are used for referencing the link in commands. For example, to filter sequences by country of origin:

GLUE> list sequence --whereClause "m49_country.m49_region.id = 'europe'" sequenceID m49_country.display_name

This command selects sequences where the country belongs to the Europe region, displaying their sequence IDs and country names.

Populating Link Targets

Once a link is defined, objects in the project database can be associated via the link using specific GLUE commands in custom object mode.

Setting Link Targets on the "ONE" End

The set link-target command associates an object with the "ONE" end of a link. For example, to associate the country Ruritania with the Eastern Europe sub-region:

Mode path: /project/example
GLUE> custom-table-row m49_country RUR
OK
Mode path: /project/example/custom-table-row/m49_country/RUR
GLUE> set link-target m49_sub_region custom-table-row/m49_sub_region/eastern_europe 
OK
(1 CustomTableObject_example_m49_country updated)

GLUE does not enforce any minimum multiplicity on custom link targets, so objects can have no associated link target. To undo an association, use the unset link-target command:

Mode path: /project/example/custom-table-row/m49_country/RUR
GLUE> unset link-target m49_sub_region 
OK
(1 CustomTableObject_example_m49_country updated)

Adding Link Targets on the "MANY" End

The add link-target command adds an object to the "MANY" end of a link. For instance, to associate the Eastern Europe sub-region with Ruritania:

Mode path: /project/example
GLUE> custom-table-row m49_sub_region eastern_europe 
OK
Mode path: /project/example/custom-table-row/m49_sub_region/eastern_europe
GLUE> add link-target m49_country custom-table-row/m49_country/RUR
OK
(1 CustomTableObject_example_m49_sub_region updated)

Querying Link Targets

Querying the "ONE" End

To display the target on the "ONE" end of a link, use the show property command:

Mode path: /project/example/custom-table-row/m49_country/RUR
GLUE> show property m49_sub_region 
propertyValueResult
  property: m49_sub_region
  value: custom-table-row/m49_sub_region/eastern_europe

Querying the "MANY" End

To list the targets on the "MANY" end of a link, use the list link-target command:

Mode path: /project/example/custom-table-row/m49_sub_region/eastern_europe
GLUE> list link-target m49_country id display_name
+=====+==============+
| id  | display_name |
+=====+==============+
| BGR | Bulgaria     |
| BLR | Belarus      |
| CZE | Czechia      |
| HUN | Hungary      |
| MDA | Moldova      |
| POL | Poland       |
| ROU | Romania      |
| RUR | Ruritania    |
| RUS | Russia       |
| SVK | Slovakia     |
| UKR | Ukraine      |
+=====+==============+
CustomTableObject_example_m49_countrys found: 11

Alternatively, you can achieve a similar result using the list custom-table-row command with a --whereClause:

Mode path: /project/example
GLUE> list custom-table-row m49_country --whereClause "m49_sub_region.id = 'eastern_europe'" id display_name 

Bulk Population of Schema Extensions

While fine-grained commands are useful for testing and small-scale operations, bulk population of field values and link targets is often more practical for larger datasets.

Multi-Set Field Command

The multi-set field command allows setting a field value across a group of objects selected by a whereClause.

Populating from GenBank XML and Tabular Data

Field values and link targets can be populated from GenBank XML using the genbankXmlPopulator module or from tabular data using the textFilePopulator module.

Scripting for Bulk Operations

GLUE's scripting layer provides flexibility for bulk operations. For example, the following JavaScript program populates the m49_sub_region table using a tabular data file:

var subRegions;
glue.inMode("module/exampleTabularUtilityCsv", function() {
	subRegions = glue.tableToObjects(glue.command(["load-tabular", "tabular/m49_countries/m49_sub_regions.csv"]));
});
_.each(subRegions, function(subRegion) {
	var subRegionName = subRegion["Sub-region Name"].trim();
	var subRegionId = subRegionName.toLowerCase().replace(/[ -]/g, "_");
	var m49Code = subRegion["Sub-region Code"].trim();
	var regionName = subRegion["Region Name"].trim();
	var regionId = regionName.toLowerCase();
	glue.command(["create", "custom-table-row", "m49_sub_region", subRegionId]);
	glue.inMode("custom-table-row/m49_sub_region/"+subRegionId, function() {
		glue.command(["set", "field", "display_name", subRegionName]);
		glue.command(["set", "field", "m49_code", m49Code]);
		glue.command(["set", "link-target", "m49_region", "custom-table-row/m49_region/"+regionId]);
	});
});

This approach allows you to automate and efficiently populate schema extensions, ensuring consistency and reducing the potential for manual errors.


Clone this wiki locally