
There’s a sudden buzz in SLA research – reports of studies on multimodal input abound, and a special issue of the Studies in SLA journal (42, 3, 2020) is a good example. Another is the special issue of The Language Learning Journal (47, 2019). Below is a quick summary of the Introduction to the SSLA special issue. I’ve done little more than pick out bits of the text and strip them of the references, which are, of course, essential in giving support to the claims made. If you don’t have access to the journal, get in touch with me for any of the articles you want to read.

MULTIMODAL INPUT
Mayer’s (2014) cognitive theory of multimedia learning states that learning is better when information is processed in spoken as well as written mode because learners make mental connections between the aural and visual information provided there is temporal proximity. Examples in the domain of language learning are
- storybooks with pictures read aloud,
- audiovisual input,
- subtitled audiovisual input,
- captioned audiovisual input,
- glossed audiovisual input.
What these types of input have in common is the combination of pictorial information (static or dynamic) and verbal input (spoken and/or written). Most of these input types combine not two but three sources of input:
- pictorial information,
- written verbal information in captions or subtitles, or in written text, and
- aural verbal input.
It could be argued that language learners might experience cognitive overload when engaging with both pictorial and written information in addition to aural input. However, eye-tracking research has demonstrated that language learners are able to process both pictorial and written verbal information on the condition that they are familiar with the script of the foreign language.
In addition to imagery, there are other advantages inherent in multimodal input and audiovisual input in particular. Learners need fewer words to understand TV programs compared to books. Webb and Rodgers (2009a, 2009b) have put forward knowledge of the 3,000 most frequent word families and proper nouns to reach 95% coverage of the input. However, the lexical coverage figures for TV viewing have recently been found to be lower, so the lexical demands are not as high as for reading (knowledge of the 4,000 most frequent word families for adequate comprehension and 8,000 word families for detailed comprehension. Rodgers and Webb (2011) also established that words are repeated more often in TV programs than in reading, especially in related TV programs, which is beneficial for vocabulary learning. Another advantage is the wide availability of audiovisual input using the Internet and streaming platforms. It can, thus, easily provide language learners with large amounts of authentic language input (Webb, 2015). Finally, language learners are motivated to watch L2 television, as has been well documented in surveys on language learners’ engagement with the L2 outside of the school.

LANGUAGE LEARNING FROM MULTIMODAL INPUT
Previous research into language learning from multimodal input has focused on three main areas: comprehension, vocabulary learning, and, to a lesser extent, grammar learning. A consistent finding in this area is that audiovisual input is beneficial for comprehension, in particular when learners have access to captions. Captions assist comprehension by helping to break down speech into words and thus facilitating listening and reading comprehension. Crucially, a unique support offered to learners’ comprehension by multimodal input is imagery. Research into audiovisual input has shown that it can work as a compensatory mechanism especially for low-proficiency learners.
The bulk of research into multimodal input has focused on vocabulary learning. A seminal study on the effect of TV viewing on vocabulary learning is Neuman and Koskinen’s 1992 study. They were among the first to stress the potential of audiovisual input for vocabulary learning. It was not until 2009 that the field of SLA started to pay more attention to audiovisual input. Two key studies were the corpus studies by Webb and Rodgers (2009a, 2009b), which showed the lexical demands of different types of audiovisual input. They argued that in addition to reading, audiovisual input may also be a valuable source of input for language learners. Since then, the field of SLA has witnessed a steady increase in the number of studies investigating vocabulary learning from audiovisual input. While most research into audiovisual input focused on the efficacy of captions, fewer studies focused on noncaptioned and nonsubtitled audiovisual input. Research has also moved from using short, educational clips to using full-length TV programs. Finally, in addition to studying the effectiveness of multimodal input for vocabulary learning, research has also started to study language learners’ processing of multimodal input (e.g., looking patterns of captions or pictures) by means of eye-tracking. Together, there seems to be robust evidence that language learners can indeed pick up unfamiliar words from multimodal input and that the provision of captions has the potential to increase the learning gains.
Research into the potential of multimodal input has been gaining traction, but the number of studies is still limited and mainly confined to vocabulary learning. Now that research into multimodal input is starting to broaden its focus to different aspects of learning as well as its research techniques, the present issue provides an up-to-date account of research in this area with a view to include innovative work and a range of approaches.

The special issue pursues new avenues in research into multimodal input by focusing on pronunciation, perception and segmentation skills, grammar, multiword units, and comprehension. In addition, it extends previous eye-tracking by investigating the effects of underresearched pedagogic interventions on learners’ processing of target items, target structures, and text. The studies nicely complement each other in their research methodologies and participant profiles. The special issue comprises six empirical studies and one concluding commentary.
- Different types of input (TV viewing with and without L1 or L2 subtitles, reading-while-listening, reading, listening);
- Different types of captioning (unenhanced, enhanced, no captioning);
- Different components of language learning (single words, formulaic sequences, comprehension, grammar, pronunciation);
- Different mediating learner- and item-related factors (e.g., working memory, prior vocabulary knowledge, frequency of occurrence);
- Different learning conditions (incidental learning, intentional learning, experimental and classroom-based) and time conditions (short video clips, full-length TV programs, extensive viewing);
- Different research tools (eye-tracking, productive and receptive vocabulary tests, comprehension tests).
I should say that I’ve excluded the parts on grammar learning. Here’s an extract:
Research into grammar learning through multimodal input is very scarce. More recent studies involving captions and grammar in longer treatments have provided evidence of positive benefits for L2 grammar development in adults, especially when captions are textually enhanced. However, results have not been similarly positive for all target structures, suggesting the influence of other factors such as the structure-specific saliency of a grammar token.
This sudden surge of interest in multimodal input is obviously, in part anyway, a response to the growth of on-line teaching forced on us by the Covid 19 pandemic. To me, it looks like a very promising development, particularly as a possible answer to the question of how to tackle the need to encourage inductive “chunk” learning.