2.3 Automated visual analyses
Multimodal LLMs
- Run various multimodal open source LLMs locally with Ollama.
- More experimental video LLMs available as transformers, for an example Llava-NEXT-Video.
Machine learning tools
- Whisper for audio to text transcripts.
- CLIP for image classification.
- BLIP for image description generation.
- OpenCV for YOLO, scene detection etc.
Image recognition systems
There are many commercially available image recognition systems which automatically label image content. Services include Google Vision, Azure AI Vision, Amazon Rekognition, and OpenCV. These work by inputting an image and outputting set of labels on content which has been recognised from images.
However, image recognition systems differ on the labels they provide. Table below are the labels provided by three services in the example provided by Berg & Nelimarkka (2023) when they analyse the image above. Clear differences on the count of labels provided as well as content of labels can be observed. To address these challenges, they have developed a tool to evaluate the quality of labels using the Cross-service Label Agreement Score COSLAB.
| Azure | AWS | |
apartment | ✓ |
|
|
architecture |
|
| ✓ |
asphalt | ✓ |
|
|
backpack |
|
| ✓ |
bag |
|
| ✓ |
building | ✓ | ✓ | ✓ |
campus | ✓ |
| ✓ |
car |
|
| ✓ |
city | ✓ | ✓ | ✓ |
cityscape |
|
| ✓ |
clothing |
|
| ✓ |
cloud | ✓ | ✓ |
|
commercial building | ✓ |
|
|
condo |
|
| ✓ |
condominium | ✓ |
|
|
downtown | ✓ | ✓ | ✓ |
driveway | ✓ |
|
|
evening | ✓ |
|
|
facade | ✓ |
|
|
footwear |
|
| ✓ |
freeway |
|
| ✓ |
grass | ✓ | ✓ | ✓ |
headquarters | ✓ |
|
|
high rise |
|
| ✓ |
highway |
|
| ✓ |
home | ✓ |
|
|
house | ✓ |
|
|
housing |
|
| ✓ |
intersection |
|
| ✓ |
landscape | ✓ |
|
|
lane | ✓ |
|
|
leisure | ✓ |
|
|
metropolis |
|
| ✓ |
mixed-use | ✓ |
|
|
nature |
|
| ✓ |
neighborhood |
| ✓ | |
office building |
| ✓ | |
outdoor |
| ✓ |
|
outdoors |
|
| ✓ |
park |
| ✓ | ✓ |
parking | ✓ |
|
|
path |
|
| ✓ |
person |
|
| ✓ |
plant | ✓ | ✓ | ✓ |
public space |
| ✓ |
|
recreation | ✓ |
|
|
road | ✓ | ✓ | ✓ |
road surface | ✓ | ✓ |
|
shadow | ✓ |
|
|
shoe |
|
| ✓ |
sidewalk | ✓ | ✓ | ✓ |
sky | ✓ | ✓ |
|
street | ✓ | ✓ | ✓ |
street light | ✓ |
|
|
suburb | ✓ |
| ✓ |
tar | ✓ |
|
|
tarmac |
|
| ✓ |
thoroughfare | ✓ | ✓ |
|
tower block | ✓ |
|
|
transportation |
| ✓ | |
tree | ✓ | ✓ | ✓ |
urban |
|
| ✓ |
urban design | ✓ |
|
|
vegetation |
|
| ✓ |
vehicle |
|
| ✓ |
walking |
|
| ✓ |
walkway | ✓ |
|
|
n | 37 | 15 | 39 |
Building image classifier
It is possible to build an image classifier model by manually detecting different kinds of images and then using neural network models (usually with through fine-tuning) to replicate said materials. For details, see materials such as