More recently, a simple cumulative risk index was developed to classify which adolescents are at risk for developing persistent substance disorders in adulthood using risk factors from childhood and adolescence . Another recent study built a model for predicting quantitative risk of developing cannabis use disorder in adults based on personal risk factors using statistical and machine learning approaches . Yet another recent study built models for predicting risk of developing substance use disorder by thirty years of age using separate sets of predictors from late childhood to 22 years of age. There is a substantial literature on exploring factors that lead to co-use of multiple substances . However, to our knowledge, modeling of hazardous use of multiple substances jointly has not been considered especially in the context of risk prediction modeling. More specifically, the key differences between these previous studies and our present work are that they do not provide a measure of quantitative risk or score,model concurrent use rather than hazardous use of multiple substances, and focus on a selected number of risk factors rather than a comprehensive set of potential risk factors.
As such, there is a need to develop risk prediction models for hazardous use of multiple substances based on personal risk factors of adolescent users. This study aims to fulfill this need by developing preliminary models for jointly predicting hazardous use of alcohol, cannabis grow set up, and tobacco for adolescents who have used all three substances in their lifetime. Joint statistical modeling of multiple outcomes utilizes the correlation between them, which can lead to higher power for detecting association between risk factors and outcomes and can additionally provide insight into the shared underlying mechanisms. As our goal is statistical risk prediction rather than hypothesis testing, we consider a set of potential risk factors as suggested by the literature.However, it assumes a common set of predictors for all outcomes, which limits its applicability in our context of risk prediction. This is because if a variable is predictive of one outcome but not another, model parsimony dictates that the variable should be included only in the model for the former but not the latter. Adding unimportant variables to a model adversely affects its ability to predict accurately for new participants that are not included in building the model. Moreover, regularization of regression coefficients in the model can protect against over fitting of the model especially when sample sizes are not large.
An over fitted model is sub-optimal for the purpose of predicting for new participants . However, regularization is not available in the classical approach. Therefore, we apply two relatively new statistical and machine learning methods, each of which addresses one of these limitations. Specifically, we utilize multivariate covariance generalized linear models and penalized multivariate regression with a lasso penalty . These methods have not been used to model multiple outcomes in the substance use literature perhaps because the development of joint risk prediction models has not yet been considered in a formal way. Supplementary Table 1 summarizes the 18 risk factors. They include participant demographics ; general environmental factors ; their own substance use ; and family substance use . Not all participants used substances other than alcohol, outdoor cannabis grow, and tobacco, and hence for them there was no corresponding age of first use of other substances. Therefore, to include this age variable in the model, a binary indicator of lifetime use of other substances was added to the model together with its interaction with the age of first use. This way, the interaction term had a non-zero value only for the users of other We used two multivariate statistical modeling frameworks for joint modeling of the three outcome variables: MCGLM and multivariate lasso. MCGLM: It is a novel generalization of the classical multivariate regression allowing modeling of the mean structure, variance function, and within-response covariance structure. MCGLM allows outcome-specific predictors, i.e., the predictors need not be shared across all outcomes. Its predictive accuracy was measured by root mean square error , computed using leave-one-out cross-validation .This measure differs from the ordinary RMSE and allows assessment of model performance on future unseen data more accurately by protecting against over fitting.