close
close

topicnews · October 17, 2024

Cheap AI “video scraping” can now extract data from any screen recording

Cheap AI “video scraping” can now extract data from any screen recording

Video scraping is just one of many new tricks possible if the latest large language models (LLMs) like Google’s Gemini and GPT-4o are actually “multimodal” models that allow input of audio, video, images and text. These models translate all multimedia inputs into tokens (blocks of data) that they use to make predictions about which tokens should come next in a sequence.

A term like “Token Prediction Model” (TPM) may be more accurate than “LLM” these days for AI models with multimodal inputs and outputs, but a general alternative term hasn’t really caught on yet. But no matter what you call it, an AI model that can process video input has interesting implications, both good and potentially bad.

Remove entry barriers

Willison is far from the first to feed video into AI models to produce interesting results (more on that below and here’s a 2015 article that uses the term “video scraping”), but as soon as Gemini launched its video input feature, he started experimenting with it in earnest.

In February, Willison demonstrated another early use of AI video scraping on his blog by recording a seven-second video of the books on his bookshelves and then getting Gemini 1.5 Pro to extract and paste all the book titles seen in the video as they could group them together in a structured or organized list.

Transforming unstructured data into structured data is important to Willison because he is also a data journalist. Willison has previously developed tools for data journalists, such as the Datasette project, which allows anyone to publish data as an interactive website.

To the disappointment of every data journalist, some data sources prove resistant to scraping (collecting data for analysis) due to the formatting, storage, or presentation of the data. In these cases, Willison is excited about the potential of AI video scraping as it bypasses these traditional barriers to data extraction.