Bianca Pereira
A researcher and software developer at work, a feminist by ideal, and a dancer and singer by passion.

Multimodality and the change in meaning.

Humans do not make meaning just by using their eyes, just by using their ears, or just by tactile sensations. Instead, humans make meaning by aggregating the input we receive from all our sensors. In this blog post I will talk about multimodality and meaning making by sharing my experience in attending the Multimodal Tutor workshop at the JTELSS2019. To check my reflections on other workshops in this Summer School, check my previous post.

In Computer Science, and Technology-Enhanced Learning, the term modality refers to “a particular form of sensory perception” (according to a particularly good definition given by Google dictionary). For instance, image is a modality, audio is another modality, etc. Therefore, when talking about multimodality, we are talking about aggregating multiple types of sensorial information. In this workshop, we were given a few good examples on how data in one modality can completely change the interpretation generated by data from another modality, and we learned what are the work of the workshop organisers in promoting multimodality for learning analytics.

The workshop was organised using a lecture-style presentation with the introduction of catchy examples and interesting demos. The first presenter was Jan Schneider , from the Open University in Netherlands, showing through multiple examples how the addition of a modality can change the interpretation given to a piece of data so far. In one of these examples he played an audio file and asked us which instrument was been played. By the sound of it, most people suggested it was a violin, however, when pairing the audio with a video of a person playing the instrument we could see that it was, in fact, a saw!!

If you had your eyes closed, would you say it is not a violin?

On another example, Jan showed how a video sequence from the movie Pirates of the Caribbean could receive completely different interpretations depending on the audio that goes with it. So, having motivated us on how adding a new modality can change our interpretation of data, Jan passed the stage to Daniele di Mitri.

Example on how the audio modality can change the meaning of the video modality.

Daniele di Mitri , also from the Open University in Netherlands, presented the potential of multimodality for the development of intelligent tutoring systems. He started by introducing how the concept of multimodality applies to computers: whereas humans have a series of natural sensors (e.g. ear, eye, mouth, skin) and use them to interpret the world around them, computers also deal with data coming from a multitude of sensors (e.g. camera, audio record, GPS, light sensors) and can use them to capture semantic information (i.e. meaning) about a given context. When computers are able to observe, through their sensors, how humans behave while learning a given task (e.g. the type of mistakes humans are making) and make sense of what they perceive, then computers will become more equipped at supporting humans on how to learn effectively.

In order to make this vision a reality, Daniele argues that, in learning analytics, it is not enough to provide computers with sensorial data. Instead, computers also need to reason about elements that are not embedded in the signals themselves, such as beliefs, emotions, cognition, or motivation. In Machine Learning terms, we need to provide computers with annotated training data, i.e. an association between a set of sensorial data and interpretations on what this data means in terms of metacognition, etc. By having such training data, one could train a machine learning model to predict what is being demonstrated by the behaviour of a learner. One challenge in the training of machine learning algorithms based on multimodal data is that there is not enough multimodal data annotated.

Daniele's slide on how sensorial data needs to be paired with interpretation.

In order to support the generation of annotation of multimodal data, Daniele presents his Visual Inspection Tool . This tool was developed to allow humans to synchronise data from multiple modalities and perform manual annotations (i.e. provide interpretations for the data) based on this aggregated data. One example given during the workshop was the use of this tool to annotate data about students learning how to perform CPR (Cardiopulmonary resuscitation, a life saving procedure). Current CPR training involves the use of dolls containing multiple sensors so, when the student performs CPR in the doll, a series of data points are collected and used to provide feedback to the learner. However, Daniele points that such solution is limited since it does not capture certain body postures the student needs to have while performing CPR (e.g. arms locked rather than arms relaxed). By synchronising a video with the sensor data coming from the doll, one can verify if such body postures are applied or not, and even identify to which movement performed during CPR each sensor data coming from the doll refers to. Such tool provides as a result multimodal data annotated by humans and ready to use within machine learning algorithms.

Daniele's slide presenting the Visual Inspection Tool

To close the workshop, Bibeg Limbu , also from Open University in Netherlands, presented additional applications for multimodal data in learning environments: first an application that aims at detecting if a learner has achieved the state of flow while performing a given learning activity, and another application to support students in learning calligraphy. He then followed with a discussion on which modalities should be used to provide feedback to a learner: should we use one of the modalities the learner is currently engaged with? Or should we provide feedback using a completely different modality? For instance, if someone is learning calligraphy through an app then their vision and motor modalities are in use during the task, should the feedback be visual (in use in the task) or auditory (not currently in use within the task)?

Daniele's slide presenting the Learning Pulse tool

This workshop was interesting to me for two reasons: first I got to learn a bit about multimodality and it got me thinking if and how I could apply it to my work; and second I got some interesting insights for my PhD by observing how Daniele has translated his PhD studies into multiple publications. Regarding the second item, I will save my comments to when I start my series of posts on my PhD practice. The most I would say now is that I got some ideas on how I could publish my Conceptual Model and I increased my confidence to publish a Literature Review.

Regarding the use of multimodality in my work, I do not see immediate application but it got me thinking: How can text be used in the context of multimodality? An obvious answer that came to mind was: in the detection of irony. Irony is something really hard to detect if one is using only text. My understanding is that we, as humans, detect irony mostly by observing discrepancies between the content of a message and changes in the speaker’s voice (e.g. speaking more slowly or with a different pitch in the voice) or facial expressions (e.g. roll the eyes). The problem with text is that it contains only the message, but not the ways in which irony is transmitted. In order to transmit irony in text, one needs to make more or less clear, by explaining the context, that a piece of text is ironic.

It is hard to detect irony in text...

So, apart from detection of irony, I am not sure how else text could enter in the multimodality story. Any ideas? Feel free to discuss at the comments section below.

comments powered by Disqus