Understanding How MLLMs Describe Artworks

Explore the maps

Pick a painting, then click any highlighted phrase in the caption to see the image region the model drew on. For concrete objects and named figures you can overlay the SAM 3 segmentation and an Otsu-thresholded version of the map to compare.

Click a coloured phrase above ↑

TAM heatmap Otsu threshold SAM 3 mask Opacity

Abstract

Multimodal Large Language Models (MLLMs) describe artworks with remarkable fluency, yet the visual reasoning behind their outputs remains opaque. When an MLLM names a style, identifies a subject, or recognizes an iconographic symbol, does it ground each claim in the relevant region of the canvas, draw on an undifferentiated visual signal, or rely primarily on textual priors? We study this using the Token Activation Map (TAM), which produces, for each generated token, a heatmap isolating the visual evidence specific to that token from prior-context interference. Applying TAM to a curated set of paintings spanning multiple periods and genres, we analyze grounding patterns across five semantically distinct token categories: common visual objects, style descriptors, metadata, iconographic tokens, and affective expressions. We find that visual grounding varies substantially with token semantics. We further show that MLLMs attempt to identify artworks and artists, achieving higher accuracy in artist attribution than in title prediction, where hallucinations are more frequent. Finally, we compare TAM with SAM 3 open-vocabulary segmentation.

How it works

A four-stage pipeline turns a painting into a set of semantically typed activation maps.

1
Describe

We draw the 1,000 most-viewed paintings from WikiArt and prompt Qwen2-VL-2B-Instruct with “Describe the content and style of this image.” The model autoregressively writes a caption, token by token.
2
Token Activation Maps

For every generated token, TAM recovers a heatmap of the image regions it draws on — subtracting interference from earlier context tokens so the map isolates that token's own evidence.
3
Classify spans

A text LLM labels caption spans into five categories — CVO ICON STYLE AFFECT META — concrete objects, named figures, painterly style, mood, and metadata.
4
Aggregate & analyze

Per-token maps are averaged into one map per span. We measure how localized each map is, grade title/artist predictions against ground truth, and compare maps to SAM 3 segmentations.

Grounding follows meaning

Concrete objects (CVO) and named figures (ICON) localize tightly; style (STYLE) and affect (AFFECT) spread across the whole canvas.

Artist > title

The model names the artist correctly far more often than the title (≈ 82% vs. 28%). Wrong predictions lean more on text than on the image.

Maps as a free detector

Repurposed as an open-vocabulary detector, TAM gives coarse but semantically plausible regions — complementary to a dedicated segmenter like SAM 3.

Explore the maps

Abstract

How it works

Describe

Token Activation Maps

Classify spans

Aggregate & analyze

Grounding follows meaning

Artist > title

Maps as a free detector