THEORY OF HIGH-DIMENSIONAL ℓ1-PENALIZED LOGISTIC REGRESSION
Date
2023-08-04Metadata
Show full item recordAbstract
n the setting where sample size n is sufficiently large relative to the number of features p, a classical result is that fitting a logistic model by means of maximum likelihood produces estimates that are approximately normal, unbiased and efficient. The usual claim is that these estimations are approximately valid if there are about 5-10 observations per unknown parameter. Sur and Cand`es (2019) shows in the context of the logistic regression that in the modern setting where the sample size and number of features are large and comparable, this claim is misleading and untrue, and hence, inferences based upon the results of common software packages can be unreliable. This dissertation considers the logistics regression with ℓ1− penalty and extends the results of Sur and Cand`es (2019) to the asymptotic regime where the dimension p of the covariates, and the sample size n grow together to infinity in such a way that n/p → δ ∈ (0, ∞). There are two major contributions made here. First, it explicitly characterizes the asymptotic mean square error of the ℓ1-penalized logistic regression estimators. Secondly, it provides empirical evidence of the existence and the location of a phase transition in the accuracy of signal recovery of the logistic lasso estimator in the two-dimensional sparsity-undersampling phase space. The formalism here is based on the asymptotic analysis of the GAMP algorithm. The findings offer theoretical insights into high-dimensional regression methods. For example, it can be used to tune the regularization parameter since it provides explicit characterization of the asymptotic MSE. Also, the phase transition result provides a guide for when the ℓ1-regularized estimator is reliable in the context of the logistic regression.