paper2code-arxiv-implementation

Agent skill to convert any arxiv paper into a citation-anchored, working Python implementation with ambiguity auditing

Skill file

Preview skill file
---
name: paper2code-arxiv-implementation
description: Agent skill to convert any arxiv paper into a citation-anchored, working Python implementation with ambiguity auditing
triggers:
  - "implement this arxiv paper"
  - "turn this paper into code"
  - "convert paper to implementation"
  - "generate code from arxiv"
  - "implement the paper at this URL"
  - "reproduce this ML paper"
  - "paper2code"
  - "create implementation from research paper"
---

# paper2code — Arxiv Paper to Working Implementation

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

paper2code is a Claude Code agent skill that converts any arxiv paper URL into a citation-anchored Python implementation. Every code decision references the exact paper section and equation it implements, and all gaps/ambiguities are explicitly flagged rather than silently filled in.

---

## Install

```bash
npx skills add PrathamLearnsToCode/paper2code/skills/paper2code
```

During install you'll choose:
- **Agents**: which coding agents get the skill (e.g., Claude Code)
- **Scope**: Global (recommended) or project-level
- **Method**: Symlink (recommended) or copy

Then launch your agent:

```bash
claude
```

---

## Core Commands

### Basic usage

```
/paper2code https://arxiv.org/abs/1706.03762
```

### With framework override

```
/paper2code https://arxiv.org/abs/2006.11239 --framework jax
/paper2code https://arxiv.org/abs/2006.11239 --framework pytorch   # default
/paper2code https://arxiv.org/abs/2006.11239 --framework tensorflow
```

### With mode flag

```
/paper2code 1706.03762 --mode minimal       # architecture only (default)
/paper2code 1706.03762 --mode full          # includes training loop + data pipeline
/paper2code 1706.03762 --mode educational   # extra comments + pedagogical notebook
```

### Bare arxiv ID (no URL required)

```
/paper2code 1706.03762
/paper2code 2106.09685
```

---

## Output Structure

Every run produces a directory named after the paper slug:

```
attention_is_all_you_need/
├── README.md                  # Paper summary + quick-start
├── REPRODUCTION_NOTES.md      # Ambiguity audit, unspecified choices, known deviations
├── requirements.txt           # Pinned dependencies
├── src/
│   ├── model.py               # Architecture — every layer cited to paper section
│   ├── loss.py                # Loss functions with equation references
│   ├── data.py                # Dataset skeleton with preprocessing TODOs
│   ├── train.py               # Training loop (full/educational mode)
│   ├── evaluate.py            # Metric computation
│   └── utils.py               # Shared utilities
├── configs/
│   └── base.yaml              # All hyperparams — each cited or flagged [UNSPECIFIED]
└── notebooks/
    └── walkthrough.ipynb      # Paper section → code → shape checks
```

---

## Citation Anchoring Convention

The core value of paper2code is traceability. Every non-trivial decision is tagged:

| Tag | Meaning |
|-----|---------|
| `§X.Y` | Directly specified in section X.Y |
| `§X.Y, Eq. N` | Implements equation N from section X.Y |
| `[UNSPECIFIED]` | Paper doesn't state this — choice made with alternatives listed |
| `[PARTIALLY_SPECIFIED]` | Paper mentions it but is ambiguous — quote included |
| `[ASSUMPTION]` | Reasonable inference — reasoning explained |
| `[FROM_OFFICIAL_CODE]` | Taken from authors' official implementation |

### Example — model.py with citation anchors

```python
import torch
import torch.nn as nn
import math


class MultiHeadAttention(nn.Module):
    """§3.2 — Multi-Head Attention
    
    Implements Eq. 4: MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
    where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)
    """

    def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1):
        super().__init__()
        # §3.2 — d_model = 512, h = 8 stated in Table 1
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads  # §3.2 — d_k = d_v = d_model / h = 64
        self.num_heads = num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)  # §3.2, Eq. 4 — W^O projection

        # [UNSPECIFIED] Dropout rate for attention weights not stated in §3.2
        # Using 0.1 matching the model-wide dropout (§5.4, Table 3)
        self.dropout = nn.Dropout(dropout)

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        # §3.2, Eq. 4 — project into h heads
        Q = self.W_q(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # §3.2.1, Eq. 1 — Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            # §3.2.3 — decoder masks future positions with -inf before softmax
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = torch.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        out = torch.matmul(attn_weights, V)  # (batch, heads, seq, d_k)
        out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        return self.W_o(out)  # §3.2, Eq. 4 — W^O output projection


class TransformerBlock(nn.Module):
    """§3.1 — Encoder/Decoder layer structure"""

    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads, dropout)

        # [ASSUMPTION] Using pre-norm based on stability; paper Figure 1 shows post-norm
        # Post-norm: x = LayerNorm(x + sublayer(x)) — §3.1
        # [PARTIALLY_SPECIFIED] "We apply layer normalization" — position ambiguous
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # §3.3 — FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),  # §3.3 — "ReLU activation"
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # §3.1 — residual connection around each sub-layer
        attn_out = self.attention(self.norm1(x), self.norm1(x), self.norm1(x), mask)
        x = x + self.dropout(attn_out)
        x = x + self.dropout(self.ff(self.norm2(x)))
        return x
```

### Example — configs/base.yaml with citations

```yaml
# base.yaml — All hyperparameters for attention_is_all_you_need
# Each value is either cited from the paper or flagged [UNSPECIFIED]

model:
  d_model: 512          # §3, Table 1 — "d_model = 512"
  num_heads: 8          # §3.2, Table 1 — "h = 8"
  d_ff: 2048            # §3.3, Table 1 — "d_ff = 2048"
  num_encoder_layers: 6 # §3, Table 1 — "N = 6"
  num_decoder_layers: 6 # §3, Table 1 — "N = 6"
  dropout: 0.1          # §5.4, Table 3 — "P_drop = 0.1"
  max_seq_len: 512      # [UNSPECIFIED] not stated; using 512 (common default)
                        # Alternatives: 256, 1024

training:
  batch_size: 25000     # §5.1 — "each batch ~25,000 source + target tokens"
  optimizer: adam       # §5.3 — "Adam optimizer"
  beta1: 0.9            # §5.3 — "β1 = 0.9"
  beta2: 0.98           # §5.3 — "β2 = 0.98"
  epsilon: 1.0e-9       # §5.3 — "ε = 10^-9"
  warmup_steps: 4000    # §5.3 — "warmup_steps = 4000"
  label_smoothing: 0.1  # §5.4 — "ε_ls = 0.1"
```

### Example — REPRODUCTION_NOTES.md structure

```markdown
# Reproduction Notes — Attention Is All You Need

## Ambiguity Audit

### SPECIFIED (high confidence)
| Choice | Value | Source |
|--------|-------|--------|
| d_model | 512 | §3, Table 1 |
| num_heads | 8 | §3.2, Table 1 |
| optimizer | Adam β1=0.9, β2=0.98 | §5.3 |

### PARTIALLY_SPECIFIED (judgment call made)
| Choice | Our Decision | Paper Quote | Alternatives |
|--------|-------------|-------------|--------------|
| Norm position | pre-norm | "layer norm before each sub-layer" (§3.1) conflicts with Figure 1 | post-norm |

### UNSPECIFIED (our defaults)
| Choice | Our Default | Rationale | Alternatives |
|--------|-------------|-----------|--------------|
| LayerNorm epsilon | 1e-6 | common default | 1e-5, 1e-8 |
| max_seq_len | 512 | common for WMT | 256, 1024 |

## Known Deviations
- data.py provides skeleton only; WMT14 preprocessing not implemented
- No beam search decoding (§5 mentions beam size 4, not fully implemented)
```

---

## What paper2code Will NOT Do

Understanding limits prevents wasted debugging time:

- **Won't guarantee correctness** — matches what the paper describes; if the paper is wrong, the code is wrong
- **Won't invent details silently** — gaps are always `[UNSPECIFIED]`, never filled confidently
- **Won't download datasets** — `data.py` gives a `Dataset` skeleton with instructions
- **Won't set up training infrastructure** — no distributed training, no experiment tracking
- **Won't implement baselines** — only the paper's core contribution
- **Won't reimplement standard components** — imports them or notes the dependency

---

## Common Patterns

### Pattern 1 — Implement a new architecture paper

```
/paper2code https://arxiv.org/abs/2010.11929 --mode minimal
```

Focus: `src/model.py` will contain the full architecture. Review `REPRODUCTION_NOTES.md` to understand every ambiguous choice before running.

### Pattern 2 — Reproduce a training method

```
/paper2code https://arxiv.org/abs/2006.11239 --mode full --framework pytorch
```

Focus: `src/train.py` will contain the full training loop. `configs/base.yaml` will list every hyperparameter with paper citations.

### Pattern 3 — Educational deep-dive

```
/paper2code 1706.03762 --mode educational
```

Focus: `notebooks/walkthrough.ipynb` walks through each paper section, shows corresponding code, and runs CPU-safe shape checks.

### Pattern 4 — Quick architecture prototype

```
/paper2code 2106.09685  # ViT
```

Then inspect and run:

```bash
cd vision_transformer/
pip install -r requirements.txt
python -c "
from src.model import VisionTransformer
import torch
model = VisionTransformer()  # toy config
x = torch.randn(2, 3, 224, 224)
print(model(x).shape)
"
```

---

## Troubleshooting

### Skill not triggering
- Confirm install completed: `npx skills list` should show `paper2code-arxiv-implementation`
- Use the explicit trigger: `/paper2code <url>`
- Try bare arxiv ID format: `/paper2code 1706.03762`

### Generated code has import errors
- Run `pip install -r requirements.txt` first
- Check `REPRODUCTION_NOTES.md` for noted dependencies
- Standard components (e.g., HuggingFace transformers) are imported, not reimplemented — install them separately

### "Paper not found" or fetch errors
- Confirm the arxiv ID exists: `https://arxiv.org/abs/<ID>`
- Try the full URL instead of bare ID
- Some very new papers (hours old) may not be indexed yet

### Silent assumptions in generated code
- This should not happen by design — if you find one, it's a bug
- Check `REPRODUCTION_NOTES.md` first; the assumption may be documented there
- Report via the repo issues if a gap was genuinely filled silently

### Framework-specific issues
- Default framework is PyTorch — omitting `--framework` gives PyTorch output
- JAX output requires `jax`, `flax`, `optax` — listed in `requirements.txt`
- TensorFlow output requires `tensorflow>=2.x`

---

## Contributing

### Add a worked example
1. Run: `/paper2code https://arxiv.org/abs/XXXX.XXXXX`
2. Save output to `skills/paper2code/worked/{paper_slug}/`
3. Write `review.md` evaluating correctness, flagged ambiguities, and any mistakes
4. Submit PR

### Improve guardrails
Add patterns where the skill makes silent assumptions to `guardrails/`.

### Add domain knowledge
Papers in your subfield reference common components? Add a knowledge file to `knowledge/` (e.g., `knowledge/graph_neural_networks.md`).

---

## Resources

- **Repo**: https://github.com/PrathamLearnsToCode/paper2code
- **Worked examples**: `skills/paper2code/worked/` in the repo
- **Issues**: https://github.com/PrathamLearnsToCode/paper2code/issues
- **License**: MIT

Source

Creator's repository · aradotso/trending-skills

View on GitHub

Security

Security checks in progress
Results will appear here once audits complete
What this skill can do
Reads your filesConnects to the internetRuns code on your machine
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk