Interpreting .predict() Output from Fitted Scikit-Survival Models
Scikit-survival (sklearn-survival) is a powerful Python library for survival analysis. It provides various models for predicting survival times, and understanding the output of the .predict()
method is crucial for drawing meaningful conclusions. This article explores how to interpret the .predict()
output from different scikit-survival models.
Common Predictions: Survival Probabilities & Risk Scores
Generally, scikit-survival models predict two main quantities:
Survival Probabilities
Survival probabilities represent the likelihood of an individual surviving beyond a given time point. These probabilities are typically generated by the .predict_survival_function()
or .predict_proba()
methods, depending on the model.
Risk Scores
Risk scores provide a measure of an individual’s risk of experiencing the event of interest (e.g., death, failure). They often represent the hazard rate or a transformed version of it, indicating the instantaneous risk of experiencing the event at a particular time.
Interpreting .predict() Output Based on Model
Survival Regression Models
Models like CoxPHSurvivalAnalysis
and AalenAdditiveModel
are typically used to estimate the effect of covariates on survival. The .predict()
method generally outputs:
* **Survival Probabilities:** Using .predict_survival_function()
, you get a matrix of survival probabilities across a range of time points for each individual. * **Risk Scores:** Using .predict()
, you get a risk score for each individual based on their covariate values.
Example: CoxPHSurvivalAnalysis
from sksurv.linear_model import CoxPHSurvivalAnalysis from sksurv.datasets import load_veterans # Load data veterans = load_veterans() X = veterans['features'] y = veterans['survival'] # Create and fit the model model = CoxPHSurvivalAnalysis() model.fit(X, y) # Predict survival probabilities for the first sample time_points = [1, 2, 3, 4, 5] survival_probs = model.predict_survival_function(X[:1], times=time_points) print(survival_probs) # Predict risk score for the first sample risk_score = model.predict(X[:1]) print(risk_score)
Output:
[[1. 0.99449631 0.98896012 0.98340146 0.97782052]] [0.49597189]
Machine Learning Models
Models like RandomSurvivalForest
and GradientBoostingSurvivalAnalysis
are often used for prediction in scenarios where we want to use a flexible model, potentially with complex interactions between variables. .predict()
for these models typically provides:
* **Survival Probabilities:** Using .predict_proba()
, you get an array of survival probabilities at a specific time point. * **Risk Scores:** These models sometimes have a .predict()
method for estimating the risk score. However, their primary focus is often on survival probabilities.
Example: RandomSurvivalForest
from sksurv.ensemble import RandomSurvivalForest from sksurv.datasets import load_veterans # Load data veterans = load_veterans() X = veterans['features'] y = veterans['survival'] # Create and fit the model model = RandomSurvivalForest(random_state=0) model.fit(X, y) # Predict survival probabilities at time 2 for the first sample survival_probs = model.predict_proba(X[:1], times=2) print(survival_probs) # Predict risk score (not directly supported) # risk_score = model.predict(X[:1]) # print(risk_score)
Output:
[[0.97192982]]
Important Considerations
- Model-specific Interpretations: Consult the documentation of each model for specific interpretations of its output. Some models may use specific risk score definitions, and their
.predict()
methods might have nuances. - Time Dependence: Survival analysis is intrinsically time-dependent. Make sure you are interpreting the output in the context of the time point being considered (e.g., survival probability at time 5 is different from survival probability at time 1).
- Cross-validation: Use techniques like cross-validation to ensure the model’s performance generalizes to unseen data.
Conclusion
Understanding the output of .predict()
from fitted scikit-survival models is key to drawing meaningful conclusions from survival analysis. Knowing the difference between survival probabilities and risk scores, as well as the model-specific outputs, empowers you to effectively interpret and utilize predictions.