29.03.2025

Disclaimer: Throughout this proposal, I use “we” instead of “I” to reflect that this project is intended as a collaborative effort.

Temporal Emergence of Crossmodal Spatial Representations in the Auditory and Visual Dorsal Pathways

Research question

At what time points do neural representations of auditory and visual cues converge within the dorsal stream to form a stable, multimodal spatial code?

Introduction

Studies indicate that auditory processing mirrors visual processing, with both modalities divided into two streams (Romanski, 1999).
This project uses MEG recordings to investigate the temporal dynamics of crossmodal integration in both the auditory and visual dorsal streams. Based on previous research, we hypothesize that crossmodal influences, defined as the impact of one sensory modality on the processing of another, are particularly strong in the auditory cortex and that, over time, both dorsal streams converge to form a stable, multimodal spatial representation.
A detailed understanding of the precise timing of these integration processes will provide insights into the dynamic interplay between auditory and visual cues within the dorsal streams.

Background

The auditory processing is organized into two pathways. The posterior-dorsal (“where”) stream begins in the caudal part of the superior temporal sulcus (STS) and specializes in the localization of sounds, motion-in-space, and aspects of language. Whereas the anterior-ventral (“what”) pathway is more involved in pattern and speech recognition (Long, 2018; see Figure 1). This organization is similar to the visual system, where the ventral pathway is dedicated to object recognition and the dorsal pathway to processing motion (Rauschecker, 2020).


Figure1: Expanded model of dual auditory processing streams in the Human brain (Figure 3: Rauschecker,2011)

Understanding and comparing the spatial flow and integration within these dorsal streams is crucial for assessing their crossmodal properties.
Evidence shows that the auditory cortex is organized in an “onion-like” structure, complicating a clear-cut distinction between dorsal and ventral stream (Millet, 2022).

When Does Crossmodulation Emerge?

Auditory stimuli show little influence on early visual areas in the visual dorsal stream (Hickok, 2007). In contrast, visual signals strongly influence activity in auditory dorsal regions. Recent studies even show that the visual bias in integration can exceed optimal model predictions, indicating a sensory hierarchy favouring vision​ (Callan, 2015).
Also, the visual dorsal stream might show earlier crossmodal tuning (in area hMT+/V5), whereas early auditory dorsal stages remain unimodal and only later show signs of crossmodality. (Rezk, 2020)
Although these findings seem contradictory, they imply that while crossmodal integration in the auditory cortex occurs later, its influence is stronger. Figure 2 shows that successful crossmodal integration is primarily observed in the right hemisphere within hMT+/V5. In contrast, the auditory dorsal stream exhibits more dominant activation in the left hemisphere, consistent with its unilateral organization compared to the bilateral auditory ventral stream.


Figure 2: Group-level univariate motion selectivity showing the overlap between MTa and hMT+/V5, (Figure 1A: Razk, 2020)

The strong influence of visual cues in the auditory dorsal stream is called the “Ventriloquist Illusion,” where a sound is paired with a discrepant visual stimulus, and the visual input diminishes the spatial tuning of this stream. For example, a voice appears to originate from a moving mouth rather than the actual speaker. This shows that the auditory cortex is multimodal and uses visual input to process sound location.
This could imply that the visual dorsal stream is more dominant in multimodal spatial location tasks (Bruns, 2019).

Methods

Study Design

The study will expose participants to artificial stimuli across three conditions:

  • Visual:
    Moving dots on a screen shift from one side to the other (contrast and speed are manipulated to reduce ventral stream engagement).

  • Auditory:
    Spatialized tones using head-related transfer functions to simulate left/right spatial locations.

  • Audiovisual:
    Combined presentations with congruent (matching spatial cues) and incongruent (mismatched cues) conditions.

The study will be conducted using a randomized block design. Participants will perform a spatial localization task, indicating the accurate direction and appearance of the stimulus.

Data Recording

MEG Recording

We will use high-density MEG to capture neural responses with millisecond temporal resolution. This will enable us to map the timing of crossmodal integration precisely and localize activity in dorsal stream regions. Additionally, EEG recordings may provide additional temporal information.
Since the visual dorsal stream is predominantly right-hemispheric and the auditory dorsal stream primarily left-hemispheric, it will help differentiate and analyze these pathways (Hickok, 2007; see Fig. 2). This approach focuses on temporal integration, while previous fMRI findings from studies provide better spatial information about the dorsal streams (Callan, 2015; Rezk, 2020).

Comparing The Data

We will then apply Representational Similarity Analysis (RSA) to compare the different response patterns across these modalities. We will compute Representational Dissimilarity Matrices (RDMs) for each condition at every timepoint from the MEG data.
These RDMs will show us the pairwise dissimilarity of the spatial response pattern across stimuli.
By correlating the RMDs from the auditory and visual conditions over time, we can determine when their representations converge and observe the onset of crossmodal integration of both dorsal pathways.
Then, comparing the audiovisual condition with the unimodal conditions will reveal how congruent or incongruent multimodality affects the integration process.
This RSA will help us answer questions about the emergence and stability of crossmodal representations of the dorsal streams (Cecere, 2017; Devereux, 2013).

Outlook

The auditory dorsal stream could also have a parallel structure, like the ventral stream (Hickok, 2007). Part of these parallel dorsal streams could terminate in frontal and premotor cortices, showing further processing of the multimodal input by both dorsal streams.
Subsequent studies could build on our findings and develop computational models incorporating auditory and visual dorsal streams to explore how stable, multimodal representations emerge. This could be done using Recurrent Neural Networks (RNN), which capture temporal integration very well. With models like these, we could further assess the robustness and dynamics of crossmodal merging, providing deeper insights into how dorsal stream networks integrate spatial information across modalities.

Sources

Bruns, P. (2019). The Ventriloquist Illusion as a Tool to Study Multisensory Processing: An Update. Frontiers in Integrative Neuroscience, 13, 51. https://doi.org/10.3389/fnint.2019.00051

Callan, A., Callan, D., & Ando, H. (2015). An fMRI Study of the Ventriloquism Effect. Cerebral Cortex, 25(11), 4248–4258. https://doi.org/10.1093/cercor/bhu306

Cecere, R., Gross, J., Willis, A., & Thut, G. (2017). Being First Matters: Topographical Representational Similarity Analysis of ERP Signals Reveals Separate Networks for Audiovisual Temporal Binding Depending on the Leading Sense. The Journal of Neuroscience, 37(21), 5274–5287. https://doi.org/10.1523/JNEUROSCI.2926-16.2017

Devereux, B. J., Clarke, A., Marouchos, A., & Tyler, L. K. (2013). Representational Similarity Analysis Reveals Commonalities and Differences in the Semantic Processing of Words and Objects. The Journal of Neuroscience, 33(48), 18906–18916. https://doi.org/10.1523/JNEUROSCI.3809-13.2013

Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8(5), 393–402. https://doi.org/10.1038/nrn2113

Long, B., Yu, C.-P., & Konkle, T. (2018). Mid-level visual features underlie the high-level categorical organization of the ventral stream. Proceedings of the National Academy of Sciences, 115(38). https://doi.org/10.1073/pnas.1719616115

Millet, J., Caucheteux, C., Boubenec, Y., Gramfort, A., Dunbar, E., Pallier, C., & King, J. R. (2022). Toward a realistic model of speech processing in the brain with self-supervised learning. Advances in Neural Information Processing Systems35, 33428-33443.

Rauschecker, J. P., & Tian, B. (2000). Mechanisms and streams for processing of “what” and “where” in auditory cortex. Proceedings of the National Academy of Sciences, 97(22), 11800–11806. https://doi.org/10.1073/pnas.97.22.11800

Rauschecker, J. P. (2011). An expanded role for the dorsal auditory pathway in sensorimotor control and integration. Hearing Research, 271(1–2), 16–25. https://doi.org/10.1016/j.heares.2010.09.001

Rezk, M., Cattoir, S., Battal, C., Occelli, V., Mattioni, S., & Collignon, O. (2020). Shared Representation of Visual and Auditory Motion Directions in the Human Middle-Temporal Cortex. Current Biology, 30(12), 2289-2299.e8. https://doi.org/10.1016/j.cub.2020.04.039

Romanski, L. M., Tian, B., Fritz, J., Mishkin, M., Goldman-Rakic, P. S., & Rauschecker, J. P. (1999). Dual streams of auditory afferents target multiple domains in the primate prefrontal cortex. Nature Neuroscience, 2(12), 1131–1136. https://doi.org/10.1038/16056

see also

Machine Learning for Cognitive Computational Neuroscience