In recent years, there has been a growing interest in the
prediction of individualized treatment effects. While there is a rapidly
growing literature on the development of such models, there is little
literature on the evaluation of their performance. In this paper, we aim
to facilitate the validation of prediction models for individualized
treatment effects. The estimands of interest are defined based on the
potential outcomes framework, which facilitates a comparison of existing
and novel measures. In particular, we examine existing measures of
discrimination for benefit (variations of the c‐for‐benefit), and
propose model‐based extensions to the treatment effect setting for
discrimination and calibration metrics that have a strong basis in
outcome risk prediction. The main focus is on randomized trial data with
binary endpoints and on models that provide individualized treatment
effect predictions and potential outcome predictions. We use simulated
data to provide insight into the characteristics of the examined
discrimination and calibration statistics under consideration, and
further illustrate all methods in a trial of acute ischemic stroke
treatment. The results show that the proposed model‐based statistics had
the best characteristics in terms of bias and accuracy. While resampling
methods adjusted for the optimism of performance estimates in the
development data, they had a high variance across replications that
limited their accuracy. Therefore, individualized treatment effect
models are best validated in independent data. To aid implementation, a
software implementation of the proposed methods was made available in
R.