๐งช Evaluating a Model
This guide walks you through everything needed to run an evaluation job: prerequisites, configuration, launching evaluation, and understanding what happens at each stage.
โ Prerequisites
๐งช Environment
Ensure all dependencies are installed. The pipeline requires:
- PyTorch
- Hugging Face Accelerate
๐ Data
Evaluation expects:
- A CSV file with a predefined
testsplit column - A root directory containing the image data
The exact CSV schema depends on the dataset (EMBED or CSAW).
โ๏ธ Model configuration
Each model has a YAML configuration file under config/models/. For example, LMV-Net uses:
config/models/lmv_net.yaml
These files define model-specific hyperparameters that extend the base CLI arguments. If no YAML is found, the pipeline falls back to CLI defaults.
๐งฉ Registration model
ImgFeatAlign and LMV-Net require a pretrained MammoRegNet registration model. The path is defined in:
config/config.py
paths.csaw_path_saved_reg_modelpaths.embed_path_saved_reg_model
(depending on the dataset)
๐๏ธ Trained model checkpoint
A trained model checkpoint is required. You can use either:
- The best checkpoint:
best_model_risk_prediction_id-{id}.pth - The last epoch checkpoint:
model_risk_prediction_training_id_{id}_last_epoch.pth
Select between them using --best_model True/False.
๐ Running Evaluation
Each model has a dedicated shell script that sets all required arguments:
bash scripts/test_lmv_net.sh
bash scripts/test_imgfeatalign.sh
bash scripts/test_vmra_mar.sh
bash scripts/test_oa_breacr.sh
bash scripts/test_mirai.sh
๐ Use accelerate launch (instead of python) to enable multi-GPU evaluation.
๐งพ CLI Arguments
๐ด Required
| Argument | Description |
|---|---|
--model |
Model name (e.g. Mirai, OA-BreaCR, VMRA-MaR, ImgFeatAlign, LMV-Net) |
--dataset |
Dataset name: EMBED or CSAW |
--csv_file |
Path to CSV file containing the test split |
--data_root |
Root directory of image data |
--path_out_dir |
Directory where the trained model checkpoint is stored |
--path_test_folder |
Output directory for evaluation results and logs |
--id_training |
Training run ID used to resolve the checkpoint filename |
๐ก Key Optional
| Argument | Default | Description |
|---|---|---|
--batch_size |
20 | Batch size per GPU |
--num_workers |
4 | DataLoader worker processes |
--best_model |
โ | If True, loads the best checkpoint; otherwise loads the last epoch |
--seed |
โ | Random seed for reproducibility |
โ๏ธ Configuration System
Argument loading occurs in two stages:
- CLI arguments are parsed first (including
--model) - Model YAML config is loaded from:
config/models/<model_name>.yaml
YAML values are added as CLI defaults and can always be overridden on the command line.
๐ Evaluation Pipeline
When evaluation starts, the following steps occur:
1๏ธโฃ Argument Parsing & Setup
main_test.py:
- Parses CLI arguments
- Loads YAML config
- Resolves the checkpoint path based on
--id_trainingand--best_model:
best_model_risk_prediction_id-{id}.pth # if --best_model True
model_risk_prediction_training_id_{id}_last_epoch.pth # otherwise
2๏ธโฃ Accelerator Initialisation
A Hugging Face Accelerator is created, enabling:
- Multi-GPU inference
- Automatic device placement
- Distributed tensor gathering across processes
3๏ธโฃ ๐ Reproducibility
If --seed is provided, seeds are set for:
randomtorchtorch.cuda
4๏ธโฃ ๐ฆ Data Loading
get_dataset_and_loader() creates a test DataLoader for the test split.
๐ No augmentation is applied during evaluation.
5๏ธโฃ ๐ง Model Loading
Handled by load_model() in evaluate/test_utils.py:
- Model is instantiated via
models/model_factory.py - Checkpoint is loaded from disk (
map_location="cpu") - State dict is extracted โ supports checkpoints saved as:
- Raw state dict
- Dict with
"model"or"state_dict"key module.prefixes are stripped (fromDataParallelwrapping)- Weights are loaded and model is set to
eval()mode
The model and test loader are then wrapped with:
accelerator.prepare()
6๏ธโฃ ๐ Inference Loop
Runs under torch.no_grad(). For each batch, the following are collected and gathered across all GPUs:
predsโ risk predictions from the primary risk headevent_timesโ follow-up timesevent_observedโ event indicatorsdensitiesโ breast density categoriescancer_typesโ cancer type categoriesracesโ race IDs (EMBED dataset only)
7๏ธโฃ ๐ Aggregation & Metric Computation
After inference, the main process:
- Concatenates all gathered tensors
- Computes the censoring distribution from event times and indicators
- Saves raw predictions and metadata to disk via
save_model_results_to_file() - Computes all metrics with bootstrapped 95% confidence intervals:
| Metric | Stratification |
|---|---|
| C-index | Overall |
| AUC (years 1โ5) | Overall |
| AUC | By breast density |
| C-index | By breast density |
| AUC | By cancer type |
| C-index | By cancer type |
| AUC | By race (EMBED only) |
| C-index | By race (EMBED only) |
๐ Output Directory
{path_test_folder}/
test_risk_prediction_training_id_{id}.log
c_index_by_density.json
c_index_by_cancer_type.json
c_index_by_race.json # EMBED only
All per-subgroup C-index results are saved as JSON files for downstream analysis.
๐ Final Results
At the end of evaluation, a structured results dict is printed and written to the log file:
C-index: Mean + 95% CI
Yearly AUCs: Years 1โ5, Mean + 95% CI
AUC by density: Per density category Mean + 95% CI
C-index by density: Per density category Mean + 95% CI
AUC by cancer type: Per cancer type Mean + 95% CI
C-index by cancer type: Per cancer type Mean + 95% CI
AUC by race: Per race group (EMBED only) Mean + 95% CI
C-index by race: Per race group (EMBED only) Mean + 95% CI