Match text descriptions to regions in images

Takes an image with captions and pinpoints which parts of the image each caption refers to. Useful for training data labeling or accessibility workflows.

Best for: Engineers building computer-vision datasets or improving image-understanding models.

Engineering / pipelines-dataatomicfor-engineerslight-setupfrom-file

Skill file

Preview skill file↓↑

---
name: tao-generate-image-grounding
description: "Two-step image grounding pipeline: extracts referring expressions from (image, caption) pairs and grounds them
  to pixel-space bounding boxes via a VLM. Use when the user wants to ground captions to bboxes, generate phrase-grounded
  annotations, auto-label images for grounding, or run the image_grounding pipeline. Triggers include 'image grounding',
  'phrase grounding', 'ground captions', 'auto-label image grounding', 'image_grounding'."
license: Apache-2.0
compatibility: Requires docker + nvidia-container-toolkit + at least one VLM endpoint (Gemini API key or OpenAI-compatible).
metadata:
  author: NVIDIA Corporation
  version: "0.1.0"
allowed-tools: Read Bash Write
tags:
  - image
  - grounding
  - bounding-boxes
  - auto-label
  - vlm
  - 2d-grounding
---

# Image Grounding Pipeline

Turn `(image, caption)` pairs into per-image grounded annotations: cleaned captions, referring expressions with character spans, and pixel-space bounding boxes for each expression. A single VLM (Gemini or any OpenAI-compatible endpoint) handles both steps.

## Purpose

Generate phrase-grounded training data for referring-expression and grounding models. The VLM acts as a "teacher" annotator: Step 0 extracts referring expressions from the caption while looking at the image; Step 1 returns one bbox set per expression for each image.

## Pipeline Architecture

```
Step 0: Expression extraction  → VLM cleans caption, extracts referring expressions + char spans
Step 1: Phrase grounding       → VLM returns pixel bboxes + scores per expression
```

Steps are individually selectable via `workflow.steps`. Each step writes a per-sample checkpoint to `step_<N>_*/.ckpt/<sample_id>.json` and skips already-processed records on re-run. Set `workflow.force_reprocess: true` to ignore checkpoints and reprocess from scratch.

## Instructions

### Initial setup

When a user wants to run this pipeline, walk through these steps:

1. **Input JSONL**: Ask for the JSONL path. Each line must be one object like `{"image_path": "...", "caption": "..."}`. `image_path` can be absolute or relative.
2. **Image root**: If any `image_path` values are relative, set `data.image_root` to the directory they should resolve from.
3. **API access**: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:
   1. **Gemini** — set `vlm.backend: "gemini"`; require `GOOGLE_API_KEY` (env var or `vlm.gemini.api_key`).
   2. **NIM** (e.g. `https://inference-api.nvidia.com/v1`) — set `vlm.backend: "openai"`; collect `base_url`, `model_name`, and `api_key`.
   3. **TAO inference microservice** (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
      - **Running** — collect `base_url`, `model_name`, and (optionally) `api_key`; set `vlm.backend: "openai"`.
      - **Not running** — guide the user through the `skills/applications/tao-run-inference-service` skill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, check `skills/applications/tao-run-inference-service/references/service.yaml` for `valid_network_arch_config_basenames`. Once the server is up, collect `base_url`, `model_name`, and (optionally) `api_key`; set `vlm.backend: "openai"`.
   4. **vLLM** (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
      - **Running** — collect `base_url`, `model_name`, and (optionally) `api_key`; set `vlm.backend: "openai"`.
      - **Not running** — follow [references/vllm_server.md](references/vllm_server.md) to install and launch a vLLM server, then collect `base_url`, `model_name`, and (optionally) `api_key`; set `vlm.backend: "openai"`.
   5. **Custom** (any other OpenAI-compatible endpoint) — set `vlm.backend: "openai"`; collect `base_url`, `model_name`, and (optionally) `api_key`.

   If the user has no endpoint and does not want to set one up, stop and help resolve API access first.
4. **Workflow steps**: Choose one of:
   - Full pipeline: `["0", "1"]`
   - Expression extraction only: `["0"]`
   - Grounding only: `["1"]`, which requires existing step-0 output at `results_dir/step_0_expression_extraction/annotations.jsonl`
5. **Resume vs fresh run**: By default, the workflow reuses checkpoints and skips completed records. To reprocess everything, set `image_grounding.workflow.force_reprocess=true`.

### Running the pipeline

The pipeline runs inside the TAO Toolkit container via the `auto_label` CLI:

```bash
auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    image_grounding.data.input_jsonl=/data/captions.jsonl \
    image_grounding.data.image_root=/data/images \
    image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEY
```

Generate a default spec: `auto_label default_specs results_dir=/results module_name=auto_label`, then set `autolabel_type: "image_grounding"`. All fields support Hydra dot-notation overrides on the command line.

See [references/configuration.md](references/configuration.md) for the full YAML structure, all parameters, model/endpoint setup, and error patterns.

### Recommended pilot workflow

1. Run on 5-10 images with both steps
2. Inspect `step_0_expression_extraction/annotations.jsonl` — are `cleaned_caption` and `expressions[]` accurate? Are the right noun phrases captured?
3. Inspect `step_1_grounding/annotations.jsonl` — do the bboxes in `expressions[].instances[]` look right? Are confidence scores reasonable?
4. If quality is insufficient, switch the VLM to a stronger model (e.g. `gemini-2.5-pro`) or raise `media_resolution`/`max_output_tokens`, then re-run with `force_reprocess=true`.
5. Scale to the full dataset once satisfied.

## Configuration

Key configuration fields (full reference in [references/configuration.md](references/configuration.md)):

| Field | Default | Description |
|-------|---------|-------------|
| `workflow.steps` | `["0","1"]` | Which pipeline steps to execute (`"0"` = expressions, `"1"` = grounding) |
| `workflow.max_workers` | `4` | Parallel threads per step (watch API rate limits) |
| `workflow.force_reprocess` | `false` | Ignore per-sample checkpoints and reprocess from scratch |
| `vlm.backend` | `"gemini"` | `"gemini"` or `"openai"` (OpenAI-compatible endpoint) |
| `data.input_jsonl` | required | Path to input JSONL with `image_path` + `caption` per line |
| `data.image_root` | `""` | Optional prefix for resolving relative `image_path` entries |

## Inputs

A single JSONL file at `data.input_jsonl`. One JSON object per line:

| Field | Required | Description |
|-------|----------|-------------|
| `image_path` | yes | Absolute path, or relative path resolved against `data.image_root` |
| `caption` | yes | Free-text caption for the image |
| `image_id` | no | Stable identifier; auto-derived from the filename if missing |
| `width`, `height` | no | Image dimensions in pixels; default to `1920×1080` for bbox clamping if missing |

## Outputs

All outputs go to `results_dir/`:

- `step_0_expression_extraction/annotations.jsonl` — per-record output enriched with `cleaned_caption` and `expressions[]` (each with `text`, `expression_id`, `char_span`, `noun_chunk`, empty `instances[]`).
- `step_1_grounding/annotations.jsonl` — same records with `expressions[].instances[]` filled in (each instance has `bbox: [x1,y1,x2,y2]` in pixel space, `score` in `[0.0, 1.0]`, and `bbox_id`).
- `results_dir/annotations.jsonl` — copy of the last step's output for convenience.
- `step_<N>_*/.ckpt/<sample_id>.json` — per-sample checkpoints used for resume.

## Prerequisites

- **Container**: `nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt`
- **API access**: At least one VLM endpoint (Gemini API key or OpenAI-compatible endpoint capable of image input)

Source

Creator's repository · nvidia/skills

View on GitHub ↗

License: Apache-2.0

Security

Security checks in progress

Results will appear here once audits complete

Checked by 3 independent security firms

Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub

Does it sneak in hidden code?Not yet checkedPending · Socket

Does it have known bugs?Not yet checkedPending · Snyk