CSV File Format
To use the models and dataset classes in this repository, datasets must be provided as CSV files with a predefined structure.
๐ Required Columns
| Column Name | Description |
|---|---|
patient_id |
Unique identifier for each patient |
exam_id |
Unique identifier for each exam (one per timepoint) |
ImageLaterality |
L (left) or R (right breast) |
view |
Mammography view (e.g., CC, MLO) |
diagnosed_date_year |
Year of diagnosis (empty if cancer-free) |
study_date_year |
Year the mammogram was acquired |
Time_to_Cancer_Years |
Years until diagnosis (empty if cancer-free) |
years_last_followup |
Years until last follow-up |
density |
Breast density (dataset-specific format) |
cancer_type |
Cancer type (dataset-specific encoding) |
race |
Optional demographic information |
๐งฌ Dataset Curation and Label Generation
EMBED
Positive cases:
-
BI-RADS = 6 (biopsy-proven cancer) (
assescolumn in the original csv file) -
Pathological severity = 0 or 1 (invasive or non-invasive cancer) (
path_severitycolumn in the original csv file)
Negative cases:
-
BI-RADS = 1 or 2 (negative or benign) (
assescolumn in the original csv file) -
BI-RADS 0 later reclassified as 1 or 2 (
assescolumn in the original csv file)
Cancer Year Assignment (EMBED)
- Defined as the pathology result date (biopsy confirmation)
- Corresponds to the column
procdate_anonin the original CSV file
Example (EMBED)
| exam_year | BI-RADS | diagnosis_year | time_to_cancer |
|---|---|---|---|
| 2016 | 1 | โ | 3 |
| 2017 | 2 | โ | 2 |
| 2018 | 1 | โ | 1 |
| 2019 | 6 | 2019 | 0 |
CSAW-CC
Positive cases:
-
rad_timing = 1โ screen-detected cancer -
rad_timing = 2โ interval cancer
Negative cases:
- No recorded
rad_timing
Note:
Biopsy dates are not available. rad_timing is used as a proxy for cancer timing.
Cancer Year Assignment (CSAW-CC)
- Data grouped by patient ID
- The latest exam year per patient is selected
- Screen-detected โ same year as exam year
- Interval cancer โ exam year + 1
Special case: If latest exam year = 2016 โ cancer year = 2016
Example (CSAW-CC)
| exam_year | rad_timing | cancer_year | time_to_cancer |
|---|---|---|---|
| 2013 | โ | โ | 4 |
| 2014 | โ | โ | 3 |
| 2015 | โ | โ | 2 |
| 2016 | 2 | 2017 | 1 |
โฑ Time-to-Cancer Definition
Time-to-Cancer represents the interval between a screening exam and the corresponding breast cancer diagnosis.
๐งพ Cancer Type Encoding
The column names listed below correspond to the original dataset CSV files.
| Dataset | Column | Value | Meaning |
|---|---|---|---|
| EMBED | path_severity |
0 | Invasive cancer |
| EMBED | path_severity |
1 | Non-invasive cancer |
| EMBED | path_severity |
2 | High-risk lesion |
| EMBED | path_severity |
3 | Borderline lesion |
| EMBED | path_severity |
4 | Benign findings |
| EMBED | path_severity |
5 | Negative |
| EMBED | path_severity |
6 | Non-breast cancer |
| CSAW-CC | x_type |
1 | In situ (non-invasive) |
| CSAW-CC | x_type |
2 | Invasive โค 15 mm |
| CSAW-CC | x_type |
3 | Invasive > 15 mm |
๐งพ Breast Density
The column names listed below correspond to the original dataset CSV files.
| Dataset | Column | Value | Meaning |
|---|---|---|---|
| EMBED | tissueden |
1 | Almost entirely fat (BI-RADS A) |
| EMBED | tissueden |
2 | Scattered density (BI-RADS B) |
| EMBED | tissueden |
3 | Heterogeneously dense (BI-RADS C) |
| EMBED | tissueden |
4 | Extremely dense (BI-RADS D) |
| EMBED | tissueden |
5 | Normal male |
| CSAW-CC | libra_percentdensity |
0โ100 | Continuous percentage |
๐ Examples
Cancer-Free Patient
| patient_id | exam_id | ImageLaterality | view | diagnosed_date_year | study_date_year | Time_to_Cancer_Years | years_last_followup | density | cancer_type | race |
|---|---|---|---|---|---|---|---|---|---|---|
| 10093833 | 64887706 | L | CC | โ | 2012 | โ | 3 | 1 | โ | Caucasian or White |
| 10093833 | 64887706 | L | MLO | โ | 2012 | โ | 3 | 1 | โ | Caucasian or White |
| 10093833 | 64887706 | R | CC | โ | 2012 | โ | 3 | 1 | โ | Caucasian or White |
| 10093833 | 64887706 | R | MLO | โ | 2012 | โ | 3 | 1 | โ | Caucasian or White |
| 10093833 | 41737690 | L | CC | โ | 2015 | โ | 0 | 1 | โ | Caucasian or White |
| 10093833 | 41737690 | L | MLO | โ | 2015 | โ | 0 | 1 | โ | Caucasian or White |
| 10093833 | 41737690 | R | CC | โ | 2015 | โ | 0 | 1 | โ | Caucasian or White |
| 10093833 | 41737690 | R | MLO | โ | 2015 | โ | 0 | 1 | โ | Caucasian or White |
| --- |
Cancer Patient
| patient_id | exam_id | ImageLaterality | view | diagnosed_date_year | study_date_year | Time_to_Cancer_Years | years_last_followup | density | cancer_type | race |
|---|---|---|---|---|---|---|---|---|---|---|
| 11513410 | 11470876 | L | CC | 2020 | 2016 | 4 | 4 | 3 | 1 | Caucasian or White |
| 11513410 | 11470876 | L | MLO | 2020 | 2016 | 4 | 4 | 3 | 1 | Caucasian or White |
| 11513410 | 11470876 | R | CC | 2020 | 2016 | 4 | 4 | 3 | 1 | Caucasian or White |
| 11513410 | 11470876 | R | MLO | 2020 | 2016 | 4 | 4 | 3 | 1 | Caucasian or White |
| 11513410 | 17070991 | L | CC | 2020 | 2017 | 3 | 3 | 3 | 1 | Caucasian or White |
| 11513410 | 17070991 | L | MLO | 2020 | 2017 | 3 | 3 | 3 | 1 | Caucasian or White |
| 11513410 | 17070991 | R | CC | 2020 | 2017 | 3 | 3 | 3 | 1 | Caucasian or White |
| 11513410 | 17070991 | R | MLO | 2020 | 2017 | 3 | 3 | 3 | 1 | Caucasian or White |