๐Ÿงช Evaluating a Model

This guide walks you through everything needed to run an evaluation job: prerequisites, configuration, launching evaluation, and understanding what happens at each stage.


โœ… Prerequisites

๐Ÿงช Environment
Ensure all dependencies are installed. The pipeline requires:

  • PyTorch
  • Hugging Face Accelerate

๐Ÿ“‚ Data
Evaluation expects:

  • A CSV file with a predefined test split column
  • A root directory containing the image data

The exact CSV schema depends on the dataset (EMBED or CSAW).

โš™๏ธ Model configuration
Each model has a YAML configuration file under config/models/. For example, LMV-Net uses:

config/models/lmv_net.yaml

These files define model-specific hyperparameters that extend the base CLI arguments. If no YAML is found, the pipeline falls back to CLI defaults.

๐Ÿงฉ Registration model
ImgFeatAlign and LMV-Net require a pretrained MammoRegNet registration model. The path is defined in:

config/config.py
  • paths.csaw_path_saved_reg_model
  • paths.embed_path_saved_reg_model

(depending on the dataset)

๐Ÿ‹๏ธ Trained model checkpoint
A trained model checkpoint is required. You can use either:

  • The best checkpoint: best_model_risk_prediction_id-{id}.pth
  • The last epoch checkpoint: model_risk_prediction_training_id_{id}_last_epoch.pth

Select between them using --best_model True/False.


๐Ÿš€ Running Evaluation

Each model has a dedicated shell script that sets all required arguments:

bash scripts/test_lmv_net.sh
bash scripts/test_imgfeatalign.sh
bash scripts/test_vmra_mar.sh
bash scripts/test_oa_breacr.sh
bash scripts/test_mirai.sh

๐Ÿ‘‰ Use accelerate launch (instead of python) to enable multi-GPU evaluation.


๐Ÿงพ CLI Arguments

๐Ÿ”ด Required

Argument Description
--model Model name (e.g. Mirai, OA-BreaCR, VMRA-MaR, ImgFeatAlign, LMV-Net)
--dataset Dataset name: EMBED or CSAW
--csv_file Path to CSV file containing the test split
--data_root Root directory of image data
--path_out_dir Directory where the trained model checkpoint is stored
--path_test_folder Output directory for evaluation results and logs
--id_training Training run ID used to resolve the checkpoint filename

๐ŸŸก Key Optional

Argument Default Description
--batch_size 20 Batch size per GPU
--num_workers 4 DataLoader worker processes
--best_model โ€” If True, loads the best checkpoint; otherwise loads the last epoch
--seed โ€” Random seed for reproducibility

โš™๏ธ Configuration System

Argument loading occurs in two stages:

  1. CLI arguments are parsed first (including --model)
  2. Model YAML config is loaded from:
config/models/<model_name>.yaml

YAML values are added as CLI defaults and can always be overridden on the command line.


๐Ÿ”„ Evaluation Pipeline

When evaluation starts, the following steps occur:

1๏ธโƒฃ Argument Parsing & Setup

main_test.py:

  • Parses CLI arguments
  • Loads YAML config
  • Resolves the checkpoint path based on --id_training and --best_model:
best_model_risk_prediction_id-{id}.pth           # if --best_model True
model_risk_prediction_training_id_{id}_last_epoch.pth  # otherwise

2๏ธโƒฃ Accelerator Initialisation

A Hugging Face Accelerator is created, enabling:

  • Multi-GPU inference
  • Automatic device placement
  • Distributed tensor gathering across processes

3๏ธโƒฃ ๐Ÿ” Reproducibility

If --seed is provided, seeds are set for:

  • random
  • torch
  • torch.cuda

4๏ธโƒฃ ๐Ÿ“ฆ Data Loading

get_dataset_and_loader() creates a test DataLoader for the test split.

๐Ÿ‘‰ No augmentation is applied during evaluation.


5๏ธโƒฃ ๐Ÿง  Model Loading

Handled by load_model() in evaluate/test_utils.py:

  1. Model is instantiated via models/model_factory.py
  2. Checkpoint is loaded from disk (map_location="cpu")
  3. State dict is extracted โ€” supports checkpoints saved as:
  4. Raw state dict
  5. Dict with "model" or "state_dict" key
  6. module. prefixes are stripped (from DataParallel wrapping)
  7. Weights are loaded and model is set to eval() mode

The model and test loader are then wrapped with:

accelerator.prepare()

6๏ธโƒฃ ๐Ÿ” Inference Loop

Runs under torch.no_grad(). For each batch, the following are collected and gathered across all GPUs:

  • preds โ€” risk predictions from the primary risk head
  • event_times โ€” follow-up times
  • event_observed โ€” event indicators
  • densities โ€” breast density categories
  • cancer_types โ€” cancer type categories
  • races โ€” race IDs (EMBED dataset only)

7๏ธโƒฃ ๐Ÿ“Š Aggregation & Metric Computation

After inference, the main process:

  1. Concatenates all gathered tensors
  2. Computes the censoring distribution from event times and indicators
  3. Saves raw predictions and metadata to disk via save_model_results_to_file()
  4. Computes all metrics with bootstrapped 95% confidence intervals:
Metric Stratification
C-index Overall
AUC (years 1โ€“5) Overall
AUC By breast density
C-index By breast density
AUC By cancer type
C-index By cancer type
AUC By race (EMBED only)
C-index By race (EMBED only)

๐Ÿ“ Output Directory

{path_test_folder}/
  test_risk_prediction_training_id_{id}.log
  c_index_by_density.json
  c_index_by_cancer_type.json
  c_index_by_race.json        # EMBED only

All per-subgroup C-index results are saved as JSON files for downstream analysis.


๐Ÿ“‹ Final Results

At the end of evaluation, a structured results dict is printed and written to the log file:

C-index:              Mean + 95% CI
Yearly AUCs:          Years 1โ€“5, Mean + 95% CI
AUC by density:       Per density category  Mean + 95% CI
C-index by density:   Per density category  Mean + 95% CI
AUC by cancer type:   Per cancer type  Mean + 95% CI
C-index by cancer type: Per cancer type  Mean + 95% CI
AUC by race:          Per race group (EMBED only)  Mean + 95% CI
C-index by race:      Per race group (EMBED only)  Mean + 95% CI