In 2018, Buzzfeed released a video of Barack Obama (or at least it looked like Barack Obama) calling President Trump a “total and complete dipshit.” The video was what’s known as a deepfake: the product of an advanced form of algorithmic video editing that makes it possible to ventriloquize another person so that they appear to say whatever you want. The Obama video was actually a kind of meta-warning about the issue of deepfakes made by writer and director Jordan Peele.
In the intervening three years, deepfake technology has only advanced. This TikTok video of Tom Cruise playing golf looks creepily close to the real thing:
Whether it’s propaganda videos on YouTube, political memes on Instagram, or manipulated viral protest photos on Twitter – multimedia posts on the internet have become a powerful influence on public opinion, and yet remain understudied and difficult to tackle.
More and more internet users now consume news on platforms centered on images and videos, such as Instagram, Snapchat, YouTube, and TikTok. Images are more likely than text to stick in our memory, a phenomenon known as the “picture superiority effect.” They may also be more likely to influence real-world actions: researchers have found that emotionally charged positive images like smiling faces were more likely to affect behaviors compared to equivalently emotional words.
Despite all of this, current initiatives in counter misinformation research predominantly focus on text-based information, such as the accuracy of claims in a headline or the body of an article. While there have been a number of visual misinformation studies recently, the emphasis is either on exploratory analysis of different misinformation types or contextual assessment of image credibility. However, effective AI model development for proactive detection of visual misinformation also requires insights about user reactions and engagement statistics.
Tackling manipulated multimedia, from simple deception to advanced deepfakes
When we think of visual misinformation, the first thing that might occur to us is the issue of deepfakes like the Barack Obama or Tom Cruise videos. The issue of deepfakes has driven much of recent media and public attention. However, a considerable percentage of visual misinformation involves much simpler forms of deception and less sophisticated technology.
For example, there is the use of “cheapfakes”: mis-captioned images, videos, or edits made using commercial photo and video-editing tools. Another common technique is to reuse legitimate old images and videos and present them as evidence of recent events. The technologies and approaches to manipulate multimedia are many and varied, and far more accessible than you might think. They range from inpainting (reconstructing missing regions in an image), image composition (the process of merging images) to copy-move (copying part of an image from one location to another within an image), object removal (deleting specific image constituents), face swapping, and stylistic image transformations.
Manipulations like these present a wide range of challenges to building reliable AI to counter them: from creating labeled data (data with human annotations to teach the AI to reliably discriminate between real and manipulated multimedia), to training advanced AI models, adversarial testing of models for robustness, and reliable predictions. In addition, the dynamic nature of deepfakes — both in terms of manipulation techniques and the speed with which they can be adapted — means that the AI models need to be constantly retrained and redeployed in real-world applications. AI models can be retrained and learn from past mistakes, but this strength also poses a challenge: consistent re-training requires large volumes of multimedia data and cutting edge computer power to improve.
Challenges in tackling memes
In the case of specific multimedia types such as memes, photo editing, and image manipulation tools have made it significantly easier to create misleading memes, with subtle context in the form of text overlaid on top of the images. The popularity of social media and closed-form social networks have further accelerated the trends to create enormous amounts of memetic content using readily available highly accessible digital tools and software.

Automatic identification of misinformative memes at scale is challenging though, and computationally expensive. It requires cutting-edge computer infrastructure and implementation of state-of-the-art natural language understanding, computer vision, and multimedia analysis to accurately understand multimodal (i.e textual and visual) artifacts.
Countering deceptive memes requires the AI to be able to understand the content the way humans do: holistically. When observing a meme, for example, humans don’t think about the words and the image independently of each other; we understand the combined meaning together. This is extremely challenging for AI as it analyses the text and image independently. Advances in natural language understanding and computer vision need to be jointly applied to be able to combine different modalities and decipher the meaning when they are presented together. Recent technologies like VilBERT and UNITER, which can further improve the automatic assessment of memetic content to detect harmful misinformation, are a useful step in this direction.
The scarcity of verified multimedia
Apart from the aforementioned challenges, effective solutions for visual misinformation detection require a huge amount of labeled data that captures the annotations of content moderators, journalists, and information verification experts. This data can then be leveraged to train AI models that are reliable at scrutinizing the credibility and veracity of visual content to curb the spread of deceptive content. However, the creation of such data is a mammoth task as it requires crowdsourcing strategies, reliable digital annotation tools, and data management capabilities etc.
In addition, the dynamic nature of multimedia content requires repeated cycles of tagging and annotations to constantly provide new examples to retrain the AI models. Performing this at scale is not an easy task; issues such as the biases of the data annotator and other inevitable inconsistencies must be balanced with the need to provide solutions that match the scale of the problem. However, advances in areas of machine learning such as few-shot learning and self-supervised learning have the potential to alleviate this challenge somewhat by reducing the amount of labeled data required. It also has the potential to teach the AI more general knowledge about visual content understanding, helping it to make better, more accurate predictions.
The visual misinformation landscape is rapidly evolving, and the effects of this have become alarmingly clear in recent months. The latest advances in AI offer great potential to tackle this problem but are challenged by the diversity of visual content types that carry misinformation along with labeled data scarcity. However, there are promising opportunities posed by combining emerging technological solutions and expert (fact checkers, content moderators) knowledge to tackle visual misinformation. Doing so will bolster the creation of reliable countermeasures to curb the spread of harmful misinformation and restore information integrity at large.
Dr. Anil Bandhakavi is Head of Data Science at Logically.
Copied!