vtex-io-observability-and-ops

Apply when making VTEX IO services easier to observe, troubleshoot, and operate in production. Covers metrics, structured logging, failure visibility, rate-limit awareness, and production readiness checks for backend apps. Use for integration monitoring, error diagnosis, or improving the operational quality of VTEX IO services before or after release.

Skill file

Preview skill file
---
name: vtex-io-observability-and-ops
description: "Apply when making VTEX IO services easier to observe, troubleshoot, and operate in production. Covers metrics, structured logging, failure visibility, rate-limit awareness, and production readiness checks for backend apps. Use for integration monitoring, error diagnosis, or improving the operational quality of VTEX IO services before or after release."
---

# Observability & Operational Readiness

## When this skill applies

Use this skill when a VTEX IO service needs better production visibility, troubleshooting behavior, or operational safety.

- Adding metrics to important client calls or flows
- Improving logs for routes, workers, or integrations
- Surfacing failures clearly for operations and support
- Reviewing whether a service is ready for production
- Monitoring rate-limit-sensitive integrations

Do not use this skill for:
- app policy declaration
- trust-boundary modeling
- frontend analytics or browser monitoring
- route contract design by itself

## Decision rules

- Log enough structured context to debug failures, but do not log secrets or sensitive payloads.
- Use `ctx.vtex.logger` with appropriate log levels such as `info`, `warn`, and `error` instead of `console.log`, so logs are properly collected and searchable in the VTEX logging stack.
- Treat `ctx.vtex.logger` as the native platform logging mechanism. If a partner needs to forward logs to its own logging system, prefer doing that through a dedicated integration app or client instead of replacing the VTEX logger pattern inside every service.
- Use client-level metrics on important downstream calls so integration behavior is visible below the handler layer.
- Choose metric names that reflect the integration and operation, such as `partner-get-order` or `partner-sync-catalog`, so counts, latency, and error rates can be tracked over time.
- Make failures observable at the point where they happen. Do not swallow errors silently in routes, events, or workers.
- For rate-limit-sensitive APIs, combine short timeouts, backoff-aware retries, and caching of frequent reads to reduce burst pressure and avoid hitting hard limits.
- Review whether expensive or fragile flows expose enough operational signals before releasing them.

## Hard constraints

### Constraint: Important failures must be visible in logs, metrics, or durable state

Routes, event handlers, and workers MUST not hide important failures from operators.

**Why this matters**

If failures disappear silently, the service becomes impossible to diagnose under real traffic and retries.

**Detection**

If an error is caught and ignored without logging, metric emission, or explicit failure state, STOP and surface the failure.

**Correct**

```typescript
try {
  await ctx.clients.partnerApi.sendOrder(orderId)
} catch (error) {
  ctx.vtex.logger.error({
    message: 'Failed to send order to partner',
    orderId,
    account: ctx.vtex.account,
    routeId: ctx.vtex.route?.id,
  })
  throw error
}
```

**Wrong**

```typescript
try {
  await ctx.clients.partnerApi.sendOrder(orderId)
} catch (_) {
  return
}
```

### Constraint: Metrics should be attached to important integration calls

Client calls that are operationally important SHOULD include `metric` so request behavior can be tracked consistently.

**Why this matters**

Without metrics, integration failures and latency patterns are much harder to isolate from generic route behavior.

**Detection**

If a key downstream integration call has no `metric` and operations depend on it, STOP and add a meaningful metric name.

**Correct**

```typescript
return this.http.get(`/orders/${id}`, {
  metric: 'partner-get-order',
})
```

**Wrong**

```typescript
return this.http.get(`/orders/${id}`)
```

### Constraint: Logs must stay useful without leaking sensitive data

Logs MUST contain enough context to debug production behavior, but MUST NOT include secrets, tokens, or unnecessarily sensitive payloads.

**Why this matters**

Operational logs are only valuable if they are safe to retain and inspect. Sensitive logging creates security risk while still failing to guarantee useful diagnosis.

**Detection**

If a log line includes tokens, auth headers, raw personal payloads, or entire downstream responses, STOP and sanitize the log.

**Correct**

```typescript
ctx.vtex.logger.info({
  message: 'Partner sync started',
  orderId,
  account: ctx.vtex.account,
})
```

**Wrong**

```typescript
ctx.vtex.logger.info({
  message: 'Partner sync started',
  body: ctx.request.body,
  auth: ctx.request.header.authorization,
})
```

## Preferred pattern

Operationally healthy VTEX IO services should:

- emit metrics for important client calls so counts, latency, and error rates are visible
- log failures with enough structured context such as domain IDs, account, and `routeId`
- avoid silent error swallowing
- sanitize sensitive data before logging
- review retries, caching, and throughput with rate-limit behavior in mind

Use observability to shorten diagnosis time, not just to create more logs.

## Common failure modes

- Catching and ignoring errors in async flows.
- Logging too little context to diagnose production incidents.
- Logging too much sensitive data.
- Omitting metrics from important integration calls.
- Treating rate-limit failures as isolated bugs instead of operational signals.

## Review checklist

- [ ] Are important failures visible to operators?
- [ ] Do key integrations emit useful metrics?
- [ ] Are logs structured and safe?
- [ ] Are retries, caching, and rate-limit behavior considered together?
- [ ] Would someone on call be able to diagnose this flow from the available signals?

## Reference

- [Using Node Clients](https://developers.vtex.com/docs/guides/using-node-clients) - Client usage patterns relevant to metrics and retries
- [Best practices for avoiding rate-limit errors](https://developers.vtex.com/docs/guides/best-practices-for-avoiding-rate-limit-errors) - Operational guidance for stable integrations

Source

Creator's repository · vtex/skills

View on GitHub

Security

Security checks in progress
Results will appear here once audits complete
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk