Table of Contents
Introduction
When conducting a large prospective cohort study, it is crucial to investigate prediction and variable selection for the desired outcome. Typically, variables are selected based on their high predictive power, determined by some measure of prediction performance. However, it is essential to assess the predictive power using data that was not used during the technical development. In epidemiological studies, the lack of proper validation is often criticized in prediction and variable selection processes (1,2). To address this issue, we propose using a nested case-control design instead of splitting the cohort.
The Nested Case-Control Design
The nested case-control design includes all cases of interest and selects controls from subjects who were event-free at the case’s event time in the full cohort (risk-set matched case-control design). This design aims to produce results similar to those of a full cohort analysis (7,8). By using the case-control cohort as the training data set, we can develop prediction models and conduct variable selection without losing statistical power. The remaining cohort serves as the validation data set. Although only partial validation is available based on the specificity of true negative predictions, this approach is more suitable for predicting uncommon clinical events where limiting false positive predictions (1-specificity) is of greater interest.
Variable Selection in Prospective Cohort Studies
Prospective cohort studies often collect extensive data to explore the nature of exposures and their relationships with clinical outcomes. Variable selection is commonly used to identify relevant exposures. While one may perform exhaustive analyses on individual variables, this approach leads to inflated false positive findings and ignores correlations between variables. A standard method for analyzing multiple variables together is stepwise selection through multiple regression models. However, when considering interactions between exposures, the complexity increases exponentially, making this selection method impractical or yielding poor performance (3).
For high-dimensional variable selection, a nested case-control design proves advantageous as it maintains statistical power while enabling external validation. Additionally, it provides ways to control confounders and interpret the selection through fitted prediction models. In our proposed variable selection strategy, we address missing data issues by repeating the variable selection process on data imputed using a random forest technique. We consistently select variables included in multiple repetitions. By comparing internal and external specificities, we evaluate the prediction and variable selection directly, determining a valid classification cut-off where both specificities are equivalent at a desired level. To illustrate our framework, we present an example from a large prospective cohort study.
Remember, for more information about predictive variables, visit 5 WS.