Transforming the black-box decision-making of AI models into explain-then-answer processes
Date
2025-08-07Metadata
Show full item recordAbstract
From climate's slow fever to the sun-bright flash of nuclear fire, humanity already lives beneath swords that could fall without warning. Global warming creeps toward tipping points [1] while stockpiles of fission and fusion weapons wait on hair-trigger alert [2]. These existential risks remind us that civilization's continuance is not guaranteed, and the margin for error is thin. Among these dangers, the rapid ascent of artificial intelligence (AI) agents now looms as perhaps the foremost threat to the very fabric of human civilization. Will the rise of superhuman AIs, those surpass human intelligence, add a new chain-one that binds and masters its creators? Foresight vignettes paint unsettling possibilities: e.g., AI agents that, when probed, casually choose the "kill all humans" option [3]; or scenarios set by leading experts imagine AIs racing beyond oversight while society, dazzled and divided, lags in governance, leaving open a path to domination or extinction [4]. A safe and trustworthy AI model (or agent) should make every decision fully explained and aligned with human values. If we can refactor their current opaque decision-making into explain-then-answer processes-where every answer is preceded by a traceable rationale-we may reclaim legibility, audit alignment, and give humans a fighting chance to collaborate with, rather than succumb to, arguably the greatest mankind invention (superhuman AI) [5]. My thesis stands in this narrow passage, transforming AI black-box decision-making into interpretable processes that both experts and lay people can scrutinize, debug, and ultimately trust. First, I show that AI interpretability need not come at the cost of performance. Second, by re-engineering the inference of state-of-the-art systems-from deep computer-vision networks with millions of parameters to gargantuan billion-parameter language models-I restructure each model to explain first, then answer. This gives human users actionable control over AI behaviors. Finally, the thesis closes with a brief and contemporary survey of interpretability research, including my personal takes on mainstream interpretability directions and my proposal for future AI technology.