Statistical Inference for High-Dimensional Regularized Huber Regression
Abstract
With the rapid advancement of technology, the amount of available data for extracting interesting insights and meaningful patterns has grown exponentially, resulting in a significant increase in the dimensionality of datasets. However, high-dimensional data can be easily contaminated by outliers or errors with heavy-tailed distributions, rendering many conventional methods inadequate for analysis. Consequently, there has been a growing interest in applying robust methods to analyze high-dimensional data, with Huber regression with regularization being a popular choice. Existing robust methods are primarily used for parameter estimation and variable selection, and there has been a lack of tools for statistical inference in high dimensions. To overcome this challenge, researchers have incorporated techniques such as lasso in statistical inference. Specifically, they have used such shrinkage penalty as a tool for variable selection and applied ordinary least squares on the selected variables to construct confidence intervals and p-values. However, this approach results in statistical inference that is not valid because it fails to account for all the variability in the selection process. The generalized lasso problem is one of the most commonly used convex optimization problems, therefore, in this dissertation, I will focus on developing conditional statistical inferential tools in high dimensions using Huber regression with a generalized lasso as the regularization term (gl-huber). To address this problem, I will follow a framework that characterizes the distribution of a post-selection estimator that is conditioned on the selection process. Specifically, I will characterize the conditional distribution of the gl-huber post-selection estimator while conditioning on both variable selection and outlier identification events by first demonstrating that the event of variable selection and outlier detection can be represented as an affine constraint in the response variable y (a polyhedron). Using this approach, I will then show that the conditional distribution of a linear combination of responses is a univariate truncated normal distribution in cases where the random error is normal. In cases where the distribution of random error is not normal, I will show that the asymptotic distribution is still truncated normal under certain weak conditions. This will enable the development of valid post-selection conditional p-values and confidence intervals that account for the variability in the selection process and satisfy all necessary frequency properties. To further improve the procedure's performance, I propose incorporating randomized responses. To validate the efficacy of the proposed methods, both theoretical properties and computational algorithms are investigated, and their practical utility is demonstrated through a range of simulations and real-world examples.