Beyond Pixels and Words (BPW)

Description of project in an image generated by ChatGPT — Generated by DALL·E/ChatGPT 2025-03-11.

The goal of this project is to improve on the state of the art of computational models of language and perception for interactive artificial agents in a way that would go beyond mere recognition of patterns in perceptual data. The project focuses on the domain of spatial descriptions (such as “bring me the red book on the table beside my bed”) which present many open challenges for computational modelling of language and perception. In particular, it explores new dynamic models of language interpretation and generation involving integration of information from several modalities; and new machine learning scenarios whereby the agents can improve the acquisition of the missing knowledge by directly interacting with a human partner (a situated version of active learning) or integrating knowledge that it has learned off-line in a different domain (transfer learning).

Why computational modelling of spatial language?

Research shows that computational modelling of spatial language requires integration of different sources of knowledge that affect the semantics of spatial descriptions: identification of objects, arrangements of objects and scene geometry, knowledge about dynamic kinematic routines between objects, and language coordination with dialogue partners.

However, several open questions still remain about how multi-modal representations are learned by particular deep learning architectures and what improvements to these architectures can be made. Because situated agents are located within realistic and changing environments, they may encounter new conversational partners, new spaces and new objects which means that both linguistic and perceptual representations must be learned continuously. What algorithm can be used for such learning and how successful they are? Another challenge for learning from the environment is also that it is limited by the number of the contexts an agent experiences in its lifetime but the agents must be minimally useful when they are first deployed. How useful is pre-trained background knowledge from another domain and how such pre-trained knowledge can be integrated within the current learning scenario?

Our aims

These open questions identify the following specific aims of this research project:

Aim 1 Develop new multi-modal computational models of spatial descriptions that integrate perceptual, world knowledge and dialogue contexts using deep learning architectures.

Aim 2 Develop computational methods of interactive learning of spatial language through the agent’s interaction with the world (perception) and other agents (dialogue) using active learning.

Aim 3 Integrate and test deep learning models initially trained in an offline fashion and in a different domain (e.g. image description corpus) to an interactive scenario through the use of domain adaptation and transfer learning.

Aim 4 As a proof of concept, implement the models in a (virtual) situated agent.

The novelty of this project is that it treats the multimodal semantic representations underpinning spatial language as continuously adaptable and learnable from linguistic and perceptional contexts using the state-of-the-art machine learning approaches.

Work packages

The length of the project is 3 years during which these tasks are explored iteratively in stages, increasing the complexity of situations and learning.

Task 1 Computational models of spatial descriptions in interaction:

1.1 Model of objects using image data with scene geometry;
1.2 Model of world knowledge;
1.3 Model of attention and spatial perspective taking;

Task 2 Strategies for continuous interactive learning:

2.1 Domain evaluation and experiment design: table with objects and kinds of virtual environments and potential interactions with them;
2.2 Identification of interactive strategies for continuous learning;
2.3 Extension of robotic platform(s) and/or virtual environment(s);

Task 3 Adaptation of deep learning to continuous interactive learning:

3.1 Exploration of offline corpora of language, images (with depth) and deep learning;
3.2 Adaptation of deep learning algorithms for learning with limited data to our interactive scenarios (transfer learning);
3.3 Adaptation of interactive deep learning algorithms (reinforcement and active learning) to our interactive scenarios;

Task 4 Evaluation of the system:

4.1 Evaluation of interactive strategies on the rate of learning;
4.2 Evaluation of pre-training on learning (transfer learning);
4.3 Evaluation of the system in generating and understanding spatial language (architectures, feature representations, strategies, extrinsic performance).

Researchers

The project involves Simon Dobnik, University of Gothenburg (project lead, computational models of spatial language and interaction), a researcher, and a research programmer (development and system implementation) at the Department of Philosophy, Linguistics and Theory of Science (FLoV), University of Gothenburg and John Kelleher (spatial language and machine learning) at the School of Computer Science and Statistics at Trinity College Dublin. Parts of tasks may be explored in thesis projects with masters students.

The project is funded by VR Project Grant 2023-01552.

Contact

Simon Dobnik (principal investigator)

2025-03-11