2.3 Automated visual analyses

Last modified by Matti Nelimarkka on 2025/04/01 17:57

Multimodal LLMs

Run various multimodal open source LLMs locally with Ollama.
More experimental video LLMs available as transformers, for an example Llava-NEXT-Video.

Machine learning tools

Whisper for audio to text transcripts.
CLIP for image classification.
BLIP for image description generation.
OpenCV for YOLO, scene detection etc.

Image recognition systems

There are many commercially available image recognition systems which automatically label image content. Services include Google Vision, Azure AI Vision, Amazon Rekognition, and OpenCV. These work by inputting an image and outputting set of labels on content which has been recognised from images.

However, image recognition systems differ on the labels they provide. Table below are the labels provided by three services in the example provided by Berg & Nelimarkka (2023) when they analyse the image above. Clear differences on the count of labels provided as well as content of labels can be observed. To address these challenges, they have developed a tool to evaluate the quality of labels using the Cross-service Label Agreement Score COSLAB and there is a desktop application for this as well.

	Google	Azure	AWS
apartment	✓
architecture			✓
asphalt	✓
backpack			✓
bag			✓
building	✓	✓	✓
campus	✓		✓
car			✓
city	✓	✓	✓
cityscape			✓
clothing			✓
cloud	✓	✓
commercial building	✓
condo			✓
condominium	✓
downtown	✓	✓	✓
driveway	✓
evening	✓
facade	✓
footwear			✓
freeway			✓
grass	✓	✓	✓
headquarters	✓
high rise			✓
highway			✓
home	✓
house	✓
housing			✓
intersection			✓
landscape	✓
lane	✓
leisure	✓
metropolis			✓
mixed-use	✓
nature			✓
neighborhood			✓
office building			✓
outdoor		✓
outdoors			✓
park		✓	✓
parking	✓
path			✓
person			✓
plant	✓	✓	✓
public space		✓
recreation	✓
road	✓	✓	✓
road surface	✓	✓
shadow	✓
shoe			✓
sidewalk	✓	✓	✓
sky	✓	✓
street	✓	✓	✓
street light	✓
suburb	✓		✓
tar	✓
tarmac			✓
thoroughfare	✓	✓
tower block	✓
transportation			✓
tree	✓	✓	✓
urban			✓
urban design	✓
vegetation			✓
vehicle			✓
walking			✓
walkway	✓
n	37	15	39

Building image classifier

It is possible to build an image classifier model by manually detecting different kinds of images and then using neural network models (usually with through fine-tuning) to replicate said materials. For details, see materials such as

Webb Williams, N., Casas, A., & Wilkerson, J. D. (2020). Images as Data for Social Science Research: An Introduction to Convolutional Neural Nets for Image Classification. Cambridge: Cambridge University Press.

2.3 Automated visual analyses

Multimodal LLMs

Machine learning tools

Image recognition systems

Building image classifier

Navigation