Explainable AI · Multimodal LLMs · Art Understanding

Understanding How MLLMs Describe Artworks
Using Token Activation Maps

Nicola Fanelli1, Pasquale De Marinis1, Raffaele Scaringi1, Eva Cetinic2, Gennaro Vessio1, Giovanna Castellano1

1University of Bari Aldo Moro, Italy  ·  2University of Zurich, Switzerland

TL;DR

When a multimodal LLM describes a painting, where on the canvas is each word actually grounded? We trace every generated token back to the image with Token Activation Maps. Concrete subjects and named figures localize to a region; style and emotion spread diffusely across the whole canvas; metadata sits in between. Models name the artist far more reliably than the title — and the maps hint at why. Explore it yourself below.

Explore the maps

Pick a painting, then click any highlighted phrase in the caption to see the image region the model drew on. For concrete objects and named figures you can overlay the SAM 3 segmentation and an Otsu-thresholded version of the map to compare.

 
Click a coloured phrase above ↑

Abstract

Multimodal Large Language Models (MLLMs) describe artworks with remarkable fluency, yet the visual reasoning behind their outputs remains opaque. When an MLLM names a style, identifies a subject, or recognizes an iconographic symbol, does it ground each claim in the relevant region of the canvas, draw on an undifferentiated visual signal, or rely primarily on textual priors? We study this using the Token Activation Map (TAM), which produces, for each generated token, a heatmap isolating the visual evidence specific to that token from prior-context interference. Applying TAM to a curated set of paintings spanning multiple periods and genres, we analyze grounding patterns across five semantically distinct token categories: common visual objects, style descriptors, metadata, iconographic tokens, and affective expressions. We find that visual grounding varies substantially with token semantics. We further show that MLLMs attempt to identify artworks and artists, achieving higher accuracy in artist attribution than in title prediction, where hallucinations are more frequent. Finally, we compare TAM with SAM 3 open-vocabulary segmentation.

How it works

A four-stage pipeline turns a painting into a set of semantically typed activation maps.

  1. 1

    Describe

    We draw the 1,000 most-viewed paintings from WikiArt and prompt Qwen2-VL-2B-Instruct with “Describe the content and style of this image.” The model autoregressively writes a caption, token by token.

  2. 2

    Token Activation Maps

    For every generated token, TAM recovers a heatmap of the image regions it draws on — subtracting interference from earlier context tokens so the map isolates that token's own evidence.

  3. 3

    Classify spans

    A text LLM labels caption spans into five categories — CVO ICON STYLE AFFECT META — concrete objects, named figures, painterly style, mood, and metadata.

  4. 4

    Aggregate & analyze

    Per-token maps are averaged into one map per span. We measure how localized each map is, grade title/artist predictions against ground truth, and compare maps to SAM 3 segmentations.

Grounding follows meaning

Concrete objects (CVO) and named figures (ICON) localize tightly; style (STYLE) and affect (AFFECT) spread across the whole canvas.

Artist > title

The model names the artist correctly far more often than the title (≈ 82% vs. 28%). Wrong predictions lean more on text than on the image.

Maps as a free detector

Repurposed as an open-vocabulary detector, TAM gives coarse but semantically plausible regions — complementary to a dedicated segmenter like SAM 3.