CSV File Format

To use the models and dataset classes in this repository, datasets must be provided as CSV files with a predefined structure.


๐Ÿ“Œ Required Columns

Column Name Description
patient_id Unique identifier for each patient
exam_id Unique identifier for each exam (one per timepoint)
ImageLaterality L (left) or R (right breast)
view Mammography view (e.g., CC, MLO)
diagnosed_date_year Year of diagnosis (empty if cancer-free)
study_date_year Year the mammogram was acquired
Time_to_Cancer_Years Years until diagnosis (empty if cancer-free)
years_last_followup Years until last follow-up
density Breast density (dataset-specific format)
cancer_type Cancer type (dataset-specific encoding)
race Optional demographic information

๐Ÿงฌ Dataset Curation and Label Generation

EMBED

Positive cases:

  • BI-RADS = 6 (biopsy-proven cancer) (asses column in the original csv file)

  • Pathological severity = 0 or 1 (invasive or non-invasive cancer) (path_severitycolumn in the original csv file)

Negative cases:

  • BI-RADS = 1 or 2 (negative or benign) (asses column in the original csv file)

  • BI-RADS 0 later reclassified as 1 or 2 (asses column in the original csv file)

Cancer Year Assignment (EMBED)

  • Defined as the pathology result date (biopsy confirmation)
  • Corresponds to the column procdate_anon in the original CSV file

Example (EMBED)

exam_year BI-RADS diagnosis_year time_to_cancer
2016 1 โ€” 3
2017 2 โ€” 2
2018 1 โ€” 1
2019 6 2019 0

CSAW-CC

Positive cases:

  • rad_timing = 1 โ†’ screen-detected cancer

  • rad_timing = 2 โ†’ interval cancer

Negative cases:

  • No recorded rad_timing

Note:

Biopsy dates are not available. rad_timing is used as a proxy for cancer timing.


Cancer Year Assignment (CSAW-CC)

  • Data grouped by patient ID
  • The latest exam year per patient is selected
  • Screen-detected โ†’ same year as exam year
  • Interval cancer โ†’ exam year + 1

Special case: If latest exam year = 2016 โ†’ cancer year = 2016


Example (CSAW-CC)

exam_year rad_timing cancer_year time_to_cancer
2013 โ€” โ€” 4
2014 โ€” โ€” 3
2015 โ€” โ€” 2
2016 2 2017 1

โฑ Time-to-Cancer Definition

Time-to-Cancer represents the interval between a screening exam and the corresponding breast cancer diagnosis.


๐Ÿงพ Cancer Type Encoding

The column names listed below correspond to the original dataset CSV files.

Dataset Column Value Meaning
EMBED path_severity 0 Invasive cancer
EMBED path_severity 1 Non-invasive cancer
EMBED path_severity 2 High-risk lesion
EMBED path_severity 3 Borderline lesion
EMBED path_severity 4 Benign findings
EMBED path_severity 5 Negative
EMBED path_severity 6 Non-breast cancer
CSAW-CC x_type 1 In situ (non-invasive)
CSAW-CC x_type 2 Invasive โ‰ค 15 mm
CSAW-CC x_type 3 Invasive > 15 mm

๐Ÿงพ Breast Density

The column names listed below correspond to the original dataset CSV files.

Dataset Column Value Meaning
EMBED tissueden 1 Almost entirely fat (BI-RADS A)
EMBED tissueden 2 Scattered density (BI-RADS B)
EMBED tissueden 3 Heterogeneously dense (BI-RADS C)
EMBED tissueden 4 Extremely dense (BI-RADS D)
EMBED tissueden 5 Normal male
CSAW-CC libra_percentdensity 0โ€“100 Continuous percentage

๐Ÿ“Š Examples

Cancer-Free Patient

patient_id exam_id ImageLaterality view diagnosed_date_year study_date_year Time_to_Cancer_Years years_last_followup density cancer_type race
10093833 64887706 L CC โ€” 2012 โ€” 3 1 โ€” Caucasian or White
10093833 64887706 L MLO โ€” 2012 โ€” 3 1 โ€” Caucasian or White
10093833 64887706 R CC โ€” 2012 โ€” 3 1 โ€” Caucasian or White
10093833 64887706 R MLO โ€” 2012 โ€” 3 1 โ€” Caucasian or White
10093833 41737690 L CC โ€” 2015 โ€” 0 1 โ€” Caucasian or White
10093833 41737690 L MLO โ€” 2015 โ€” 0 1 โ€” Caucasian or White
10093833 41737690 R CC โ€” 2015 โ€” 0 1 โ€” Caucasian or White
10093833 41737690 R MLO โ€” 2015 โ€” 0 1 โ€” Caucasian or White
---

Cancer Patient

patient_id exam_id ImageLaterality view diagnosed_date_year study_date_year Time_to_Cancer_Years years_last_followup density cancer_type race
11513410 11470876 L CC 2020 2016 4 4 3 1 Caucasian or White
11513410 11470876 L MLO 2020 2016 4 4 3 1 Caucasian or White
11513410 11470876 R CC 2020 2016 4 4 3 1 Caucasian or White
11513410 11470876 R MLO 2020 2016 4 4 3 1 Caucasian or White
11513410 17070991 L CC 2020 2017 3 3 3 1 Caucasian or White
11513410 17070991 L MLO 2020 2017 3 3 3 1 Caucasian or White
11513410 17070991 R CC 2020 2017 3 3 3 1 Caucasian or White
11513410 17070991 R MLO 2020 2017 3 3 3 1 Caucasian or White