Towards understanding computer vision system
Abstract
In the realm of machine learning and deep neural networks, despite the significant strides made across diverse applications, the understanding and interpretation of these models, particularly under adversarial conditions and in the context of large-scale, multimodal frameworks, remain areas ripe for exploration. This thesis undertakes a detailed exploration of the biases inherent in standard and adversarially trained convolutional neural networks (CNNs). Our findings reveal a pronounced texture bias in standard CNNs, whereas their adversarially trained counterparts predominantly leverage shape cues, a distinction underscored by cue conflict experiments. Through a meticulous neuron-level analysis employing NetDissect, we observe a marked increase in the monotonicity of neurons within robust networks, suggesting a fundamental shift in their information-processing characteristics. Building on the insights gleaned from our investigation into CNNs, we delve into the domain of large-scale foundation models, where we introduce gScoreCAM, an innovative visualization technique designed to illuminate the focal points of interest within images for OpenAI's CLIP, marrying high performance with unparalleled computational efficiency, achieving a significant leap over existing methods by 8 to 10 times in speed. However, recognizing the limitations of post-hoc explanation methods regarding their reliability and fidelity, we propose the Part-based Image Classifier with an Explainable and Editable Language Bottleneck (PEEB). PEEB represents a paradigm shift towards self-explainable AI frameworks, offering a unique layer of user controllability that enhances trustworthiness by enabling intuitive and transparent explanations directly during inference. This innovation is supported by introducing the Bird-11K and Dog-160 datasets, specifically curated to advance the development of part-based self-explanatory models. This thesis's contributions illuminate the underlying biases and operational nuances of adversarially trained CNNs while simultaneously pioneering advancements in the explainability and interpretability of complex, large-scale models. By introducing PEEB, complemented by gScoreCAM, we unveil novel pathways to understanding these intricate systems and empower users with a measure of control and transparency previously unattainable, heralding a new era of trust and clarity in artificial intelligence.