Non-rigorous test score calculation

In the test score calculation, the authors use the test distribution to scale the test score per node. This is not a common practice for anomaly detection as the test statistics are usually unknown at the time of evaluation. 

I refactored the code to use the statistics of the validation set to scale both the validation score and test score, however, there is a huge performance drop, as shown in the attached screenshot. The blue one is the original implementation with the test score scaled by the test distribution. The purple one is scaled by the validation statistics. The black one is first scaled by the first 1000 test samples' statistics (roughly 1/9), generally increased to 8/9, but there is almost no change in terms of the f1 performance. The final jump of the black line occurred because I changed to use the entire test statistics instead of 8/9.

The problem persists with the ROC-AUC as well, with the test score scaled by the validation statistics, the AUC is around 0.52 for the best epoch.

This work is the foundation of many GNN for time series papers, I urge attention to this issue.

```
scores = get_err_scores(test_re_list, val_re_list)
normal_dist = get_err_scores(val_re_list, val_re_list)

def get_err_scores(test_res, val_res):
    test_predict, test_gt = test_res
    val_predict, val_gt = val_res

    n_err_mid, n_err_iqr = get_err_median_and_iqr(test_predict, test_gt)

    test_delta = np.abs(np.subtract(
                        np.array(test_predict).astype(np.float64), 
                        np.array(test_gt).astype(np.float64)
                    ))
    epsilon=1e-2

    err_scores = (test_delta - n_err_mid) / ( np.abs(n_err_iqr) +epsilon)

    smoothed_err_scores = np.zeros(err_scores.shape)
    before_num = 3
    for i in range(before_num, len(err_scores)):
        smoothed_err_scores[i] = np.mean(err_scores[i-before_num:i+1])

    
    return smoothed_err_scores
```

<img width="940" alt="Screenshot 2024-08-30 at 5 01 59 PM" src="https://github.com/user-attachments/assets/316fad4a-d796-4a36-9b31-6a24146ab7e4">


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-rigorous test score calculation #102

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Non-rigorous test score calculation #102

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions