Assessing Personality through a Chatbot-based Structured Interview: Developing and Testing Two Parallel Predictive Models
Abstract
Recent advancements in artificial intelligence (AI) and natural language processing (NLP) are changing the way organizations assess talent. Even though conversational AI has been increasingly used to conduct interviews, limited studies have comprehensively examined the psychometric properties of personality-based structured interview in AI chatbot. The current study developed and evaluated a chatbot-based structured interview for personality assessment. This chatbot is embedded with a series of behavioral and situational questions targeting specific Big Five personality traits in the workplace and mines textual features from participants’ text responses collected during the interview. First, I trained two sets of machine learning models from employees’ (n = 1,043) chatbot interview scripts to predict their self-reported personality scores and interviewer-rated personality scores, respectively. I then applied the trained models to in an independent sample of full-time managers (n = 74) and examined whether the two sets of machine-inferred personality scores predicted supervisor-rated task performance, organizational citizenship behavior (OCB), and counterproductive work behavior (CWB). Additionally, I compared how different types of text used (aggregated text vs. trait-specific text) in model training affect the psychometric properties of machine scores. Results indicated that different ground truth and different text types impacted the psychometric properties of machine-inferred personality scores. Specifically, when trained on aggregated text, self-report and interviewer-report models showed good split-half reliability and convergent validity, poor discriminant validity, and visually no criterion-related validity. Interviewer-report models exhibited overall higher quality of machine scores than did self-report models. On the other hand, when trained on trait-specific text, the predictive models showed mixed evidence of reliability and convergent validity, but substantially higher discriminant validity and low criterion-related validity. In addition, there was reasonable evidence for cross-sample generalizability of psychometric properties of machine scores. Potential explanations for model differences, implications, limitations, and future directions are discussed.