-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Move Documentation to docs. Add Workflow for release. Initial Version…
… of README
- Loading branch information
Showing
12 changed files
with
812 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
name: Documentation | ||
|
||
on: | ||
push: | ||
branches: | ||
- main | ||
tags: "v**" | ||
paths: | ||
- 'docs/**' | ||
- '.github/workflows/docs.yml' | ||
workflow_dispatch: | ||
|
||
jobs: | ||
docs: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v4 | ||
with: | ||
token: ${{ secrets.SDQ_TOKEN }} | ||
- uses: actions/checkout@v4 | ||
with: | ||
repository: ${{ github.repository }}.wiki | ||
path: wiki | ||
token: ${{ secrets.SDQ_TOKEN }} | ||
|
||
- name: Remove contents in Wiki | ||
working-directory: wiki | ||
run: ls -A1 | grep -v '.git' | xargs rm -r | ||
|
||
- name: Copy Wiki from Docs folder | ||
run: cp -r ./docs/. ./wiki | ||
|
||
- name: Deploy 🚀 | ||
uses: stefanzweifel/git-auto-commit-action@v5 | ||
with: | ||
repository: wiki |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
name: Deploy to GitHub | ||
on: | ||
workflow_dispatch: | ||
release: | ||
types: [created, published] | ||
jobs: | ||
publish-release-artifact: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v4 | ||
- uses: joshlong/java-version-export-github-action@v28 | ||
id: jve | ||
|
||
- name: Java without Cache | ||
uses: actions/setup-java@v4 | ||
with: | ||
java-version: ${{ steps.jve.outputs.java_major_version }} | ||
distribution: 'temurin' | ||
|
||
- name: Build Metrics | ||
run: mvn -U -B clean package | ||
|
||
- name: Attach CLI to Release on GitHub | ||
uses: softprops/action-gh-release@v2 | ||
with: | ||
files: cli/target/metrics-cli.jar | ||
fail_on_unmatched_files: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,27 @@ | ||
# Metrics | ||
This repository contains tools to calculate several metrics. | ||
# ArDoCo: Metrics Calculator | ||
Welcome to the **ArDoCo Metrics Calculator** project! This tool provides functionality to calculate and aggregate **classification** and **rank metrics** for various machine learning and ranking tasks. | ||
|
||
## Metrics Module | ||
The [Wiki](https://github.com/ArDoCo/Metrics/wiki) contains all the necessary information to use the **ArDoCo Metrics Calculator** via multiple interfaces, including a library, REST API, and command-line interface (CLI). | ||
|
||
## CLI | ||
## Quickstart | ||
|
||
## REST | ||
To use this project as a Maven dependency, you need to include the following dependency in your `pom.xml` file: | ||
|
||
```xml | ||
<dependency> | ||
<groupId>io.github.ardoco</groupId> | ||
<artifactId>metrics</artifactId> | ||
<version>${revision}</version> | ||
</dependency> | ||
``` | ||
|
||
To use the CLI run the following command: | ||
|
||
```shell | ||
java -jar metrics-cli.jar -h | ||
``` | ||
|
||
To use the REST API via Docker, start the server with the following command: | ||
```shell | ||
docker run -it -p 8080:8080 ghcr.io/ardoco/metrics:latest | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
<assembly xmlns="http://maven.apache.org/ASSEMBLY/2.1.1" | ||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
xsi:schemaLocation="http://maven.apache.org/ASSEMBLY/2.1.1 https://maven.apache.org/xsd/assembly-2.1.1.xsd"> | ||
<id>cli</id> | ||
<formats> | ||
<format>jar</format> | ||
</formats> | ||
<includeBaseDirectory>false</includeBaseDirectory> | ||
<dependencySets> | ||
<dependencySet> | ||
<outputDirectory>/</outputDirectory> | ||
<useProjectArtifact>true</useProjectArtifact> | ||
<unpack>true</unpack> | ||
<scope>runtime</scope> | ||
</dependencySet> | ||
</dependencySets> | ||
</assembly> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
In addition to calculating individual metrics for classification and ranking tasks, the system supports the **aggregation** of results across multiple classifications or rank-based results. Aggregation methods allow users to compute overall metrics that represent the combined performance of several tasks. | ||
|
||
## Aggregation Types | ||
|
||
The following **Aggregation Types** are supported for both classification and rank metrics: | ||
|
||
1. **Macro Average**: This type of aggregation computes the average of the metrics for each class or query, giving equal weight to each. | ||
- **Use Case**: Useful when all classes or queries are equally important, regardless of how many instances belong to each class. | ||
|
||
2. **Micro Average**: This method aggregates by counting the total true positives, false positives, and false negatives across all classes or queries, then computes the metrics globally. | ||
- **Use Case**: Useful when classes or queries have an uneven number of instances, and you want to prioritize overall accuracy over individual class performance. | ||
|
||
3. **Weighted Average**: In this method, the average is computed with weights, typically proportional to the number of instances in each class or query. | ||
- **Use Case**: Useful when certain classes or queries are more important and should contribute more to the overall metrics. | ||
|
||
## Aggregation for Classification Metrics | ||
|
||
The **AggregatedClassificationResult** class aggregates results from multiple classification tasks. It combines metrics like precision, recall, and F1-score across multiple classification results and calculates an overall score using one of the aggregation methods mentioned above. | ||
|
||
Key Metrics Aggregated: | ||
- **Precision** | ||
- **Recall** | ||
- **F1-Score** | ||
- **Accuracy (if available)** | ||
- **Specificity (if available)** | ||
- **Phi Coefficient (if available)** | ||
- **Phi Coefficient Max (if available)** | ||
- **Phi Over Phi Max (if available)** | ||
|
||
**Example:** | ||
If you perform multiple classification tasks and want a single precision or recall score, the **macro average** would treat each classification equally, while the **weighted average** would account for the number of instances in each task. | ||
|
||
## Aggregation for Rank Metrics | ||
|
||
The **AggregatedRankMetricsResult** class aggregates results from multiple ranking tasks. It computes an overall **Mean Average Precision (MAP)**, **LAG**, and **AUC** by combining the results of each individual rank task. | ||
|
||
Key Metrics Aggregated: | ||
- **Mean Average Precision (MAP)** | ||
- **LAG** | ||
- **AUC (if available)** | ||
|
||
**Example:** | ||
For search or ranking tasks, you might aggregate the **MAP** scores of multiple queries to get a single performance measure for the ranking system across all queries. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
The classification metrics calculator is responsible for computing various classification performance metrics based on input classifications and ground truth data. | ||
|
||
## Input | ||
|
||
1. **Classification**: A set of classified elements. | ||
2. **Ground Truth**: A set representing the actual classification labels for comparison. | ||
3. **String Provider Function (optional)**: A function that converts classification and ground truth elements into string representations for comparison purposes. | ||
4. **Confusion Matrix Sum (optional)**: The sum of the confusion matrix values (true positives, false positives, etc.). Some metrics may not be calculated if this is not provided. | ||
|
||
:warning: Classification result entries have to match entries in the ground truth (equals) | ||
|
||
## Supported Metrics | ||
|
||
The system calculates a variety of standard classification metrics: | ||
|
||
1. **Precision**: Measures the accuracy of the positive predictions. | ||
|
||
$$\text{Precision} = \frac{TP}{TP + FP}$$ | ||
|
||
Where: | ||
- \( TP \) is the number of true positives. | ||
- \( FP \) is the number of false positives. | ||
|
||
2. **Recall**: Also known as sensitivity, recall measures the ability to find all positive instances. | ||
|
||
$$\text{Recall} = \frac{TP}{TP + FN}$$ | ||
|
||
Where: | ||
- \( FN \) is the number of false negatives. | ||
|
||
3. **F1-Score**: A harmonic mean of precision and recall, providing a single score that balances both concerns. | ||
|
||
$$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$ | ||
|
||
4. **Accuracy (optional)**: Measures the proportion of correctly predicted instances (if true negatives are provided). | ||
|
||
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$ | ||
|
||
5. **Specificity (optional)**: Also called true negative rate, it measures the proportion of actual negatives that are correctly identified. | ||
|
||
$$\text{Specificity} = \frac{TN}{TN + FP}$$ | ||
|
||
6. **Phi Coefficient (optional)**: A measure of the degree of association between two binary variables. | ||
|
||
$$\Phi = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$$ | ||
|
||
7. **Phi Coefficient Max (optional)**: The maximum possible value for the phi coefficient. | ||
|
||
8. **Phi Over Phi Max (optional)**: The ratio of the phi coefficient to its maximum possible value. | ||
|
||
Each result includes a human-readable format that logs the computed metrics for ease of debugging and verification. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
Welcome to the **ArDoCo Metrics Calculator** project! This tool provides functionality to calculate and aggregate **classification** and **rank metrics** for various machine learning and ranking tasks. | ||
|
||
This Wiki contains all the necessary information to use the **ArDoCo Metrics Calculator** via multiple interfaces, including a library, REST API, and command-line interface (CLI). | ||
|
||
|
||
## 1. Classification Metrics | ||
|
||
This section provides detailed information about how to calculate **classification metrics** such as precision, recall, F1-score, and more. The classification metrics are essential for evaluating the performance of classification models by comparing the predicted results with the ground truth. | ||
|
||
[Read more about Classification Metrics](Classification-Metrics) | ||
|
||
|
||
|
||
## 2. Rank Metrics | ||
|
||
The rank metrics module helps you calculate metrics for ranked results, such as **Mean Average Precision (MAP)**, **LAG**, and **AUC**. These metrics are useful for evaluating ranking systems, search engines, or recommendation systems. | ||
|
||
[Read more about Rank Metrics](Rank-Metrics) | ||
|
||
|
||
|
||
## 3. Aggregation of Metrics | ||
|
||
Aggregation allows you to compute an overall metric from multiple classification or ranking tasks. This can be useful when you want to combine results from several tasks to get a single evaluation score. | ||
|
||
[Read more about Aggregation of Metrics](Aggregation-of-Metrics) | ||
|
||
|
||
## 4. Usage | ||
|
||
### 4.1 Usage via Library | ||
|
||
The **ArDoCo Metrics Calculator** can be integrated into your project as a library. This section provides instructions for adding the project as a Maven dependency and examples of how to calculate metrics programmatically. | ||
|
||
[Read more about Usage via Library](Usage-Via-Library) | ||
|
||
|
||
|
||
### 4.2 Usage via REST API | ||
|
||
The project offers a REST API for calculating metrics. You can send HTTP requests to the API to compute both classification and rank metrics, as well as aggregate results across tasks. Swagger documentation is provided for easy testing and interaction. | ||
|
||
[Read more about Usage via REST API](Usage-Via-REST-API) | ||
|
||
|
||
|
||
### 4.3 Usage via CLI | ||
|
||
For users who prefer using a command-line interface, the project offers CLI commands for calculating and aggregating metrics. This section provides detailed instructions and examples on how to use the CLI for different tasks. | ||
|
||
[Read more about Usage via CLI](Usage-Via-CLI) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
The rank metrics calculator computes performance metrics for systems that provide ranked results, such as search engines or recommendation systems. These metrics are based on the comparison between the provided ranked results and the ground truth data. | ||
|
||
## Input | ||
|
||
1. **Ranked Results**: A list of sorted lists, where each list represents the ranked results for one query or item (with the most relevant items first). | ||
2. **Ground Truth**: A set of items representing the correct or ideal results for the given queries or items. | ||
3. **String Provider Function**: A function that converts the ranked results and ground truth elements into string representations for comparison purposes. | ||
4. **Relevance-Based Input (optional)**: Contains relevance scores associated with each ranked result. This input is used for relevance-based calculations, allowing the ranking system to incorporate degrees of relevance. | ||
|
||
## Supported Metrics | ||
|
||
The rank metrics calculator computes the following key metrics: | ||
|
||
1. **Mean Average Precision (MAP)**: This metric computes the average precision for each query and then averages those precision values over all queries. It provides a single score that summarizes the quality of the ranked results. | ||
|
||
$$\text{MAP} = \frac{1}{N} \sum_{i=1}^{N} \text{AveragePrecision}(i)$$ | ||
|
||
Where: | ||
- $N$ is the number of queries. | ||
- $\text{AveragePrecision}(i)$ is the average of the precision scores at each relevant document for query $i$. It is calculated by considering only the positions where relevant items are retrieved and averaging the precision at those points. | ||
|
||
$$\text{AveragePrecision}(i) = \frac{\sum_{r=1}^{|retrieved_i|} (precision_i(r)\times relevant_i(r))}{|relevantLinks_i|}$$ | ||
|
||
Where: | ||
- $|retrieved|$ is the number of retrieved links for a query | ||
- $r$ is the rank in the produced list | ||
- $precision(r)$ is the *precision* of the list if truncated after rank $r$ | ||
- $relevant(r)$ is a binary function that determines whether the link at rank $r$ is valid (1) or not (0) | ||
- $|relevantLinks|$ is the total number of links that are relevant for this query according to the gold standard | ||
|
||
2. **LAG**: LAG measures the distance (lag) between the position of relevant items in the ranked results and their ideal positions (i.e., as close to the top as possible). It helps assess how well the system ranks relevant items near the top. | ||
|
||
$$\text{LAG} = \frac{1}{N} \sum_{i=1}^{N} \text{Lag}(i)$$ | ||
|
||
Where: | ||
- $\text{Lag}(i)$ is the average lag for query $i$. | ||
|
||
Lag measures how many incorrect links are retrieved above each correct link. For example, if the relevant item should ideally be at position 1 but is ranked at position 3, the lag for that item is 2. The lag is averaged over all relevant documents for query $i$ to compute $\text{Lag}(i)$. | ||
|
||
3. **ROC (Receiver Operating Characteristic) Curve (optional)** | ||
|
||
The **ROC curve** is a graphical representation of a classification model’s performance across different decision thresholds. It plots: | ||
|
||
- **True Positive Rate (TPR)**, or **Recall**, on the **y-axis**: $\text{TPR} = \frac{TP}{TP + FN}$ | ||
where $TP$ is the number of true positives, and $ FN $ is the number of false negatives. | ||
|
||
- **False Positive Rate (FPR)** on the **x-axis**: $\text{FPR} = \frac{FP}{FP + TN}$ | ||
where $FP$ is the number of false positives, and $ TN $ is the number of true negatives. | ||
|
||
Each point on the ROC curve corresponds to a different threshold used by the classifier to distinguish between positive and negative predictions. By adjusting the threshold, the TPR and FPR values change, and the ROC curve shows how well the classifier separates the positive from the negative class. | ||
|
||
|
||
3. **Area Under Curve (AUC) of the Receiver Operating Characteristic (ROC) (optional)**: AUC measures the ability of the system to discriminate between relevant and non-relevant items. The AUC value ranges from 0 to 1, where 1 indicates perfect discrimination. | ||
|
||
$$\text{AUC} = \int_0^1 \text{TPR}(FPR)\ dFPR$$ | ||
|
||
Where $TPR$ is the true positive rate and $ FPR $ is the false positive rate. | ||
|
Oops, something went wrong.