This Is Auburn

Graph-Based Visual-Semantic Representations for Visual-Data Understanding

Date

2025-07-15

Author

Kundu, Sanjoy

Abstract

Visual event perception tasks, such as activity recognition, require reasoning over visual-semantic concepts present in scenes. These scenes often exhibit hierarchical structure, both physically (e.g., objects contained within other objects or part of larger structures) and semantically (e.g., causal or contextual relationships among objects). Capturing and reasoning over these complex relationships directly from pixel data or textual descriptions (e.g., labels or captions) can be inefficient and limited in expressiveness. Most existing approaches rely on data-driven models trained in supervised or semi-supervised settings, typically under a closed-world assumption, where only a fixed set of categories or concepts is considered during both training and inference. This restricts the model’s ability to generalize to open-world scenarios, where arbitrary and previously unseen combinations of concepts may occur. This dissertation aims to address this challenge through the lens of graph-based visual-semantic representations, with the goal of developing frameworks that go beyond fixed taxonomies and closed-world assumptions to move toward open-world visual understanding. We begin our investigation with static scenes, focusing on generating scene graphs that capture the underlying semantic structure. Using a generative approach, we first construct label-agnostic graph structures that represent potential interactions between objects in a scene. This is followed by a relation prediction module to label the sampled edges to form spatial scene graphs which encodes the spatial relationship between the objects present in a scene. The generated graph structure not only achieves state-of-the-art performance but also significantly improves zero-shot scene graph generation which shows the better generalization capabilities of our proposed approach. To move beyond static scenes, the latter part of dissertation focuses on extending visual-semantic graph representations to egocentric video data, aiming to move toward open-world activity understanding. We first introduce a knowledge-guided learning approach that grounds concepts in a predefined label space and employs an energy-based neuro-symbolic framework to translate them into the target activity space with minimal or no supervision. We further improve this framework by incorporating advances in vision-language models and neuro-symbolic prompting to achieve more robust grounding of object concepts. The enhanced approach leverages an energy-based neuro-symbolic model to infer plausible actions over grounded concepts and integrates temporal smoothing for action prediction, supported by video foundation models. Finally, we introduce a probabilistic residual search strategy based on jump-diffusion dynamics. This method efficiently explores the semantic label space by balancing prior-guided exploration with likelihood-driven exploitation. It incorporates structured commonsense priors to define a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs), and employs stochastic search to locate high-likelihood activity labels. We evaluate our approaches on standard egocentric video datasets, including GTEA-Gaze, GTEA-Gaze Plus, EPIC-Kitchens, and Charades-Ego, achieving competitive performance against fully supervised baselines. Together, these contributions offer a unified framework for visual-semantic representation and reasoning in both static and dynamic scenes. The proposed models not only improve open-world generalization but also enhance the interpretability and trustworthiness of activity understanding systems. This work lays the foundation for building adaptable, explainable, and knowledge-aware visual perception systems capable of operating effectively in complex, real-world environments.