Local document and PDF parsing with spatial text and bounding boxes. Use for extracting text from PDFs, DOCX, Office files, and images; OCR on scans; layout-preserved JSON for RAG; batch-ingesting paper folders; or page screenshots for multimodal agents — even when the user does not name liteparse. Prefer over MarkItDown when you need bboxes, fast local parsing, or PNG page renders; prefer over the pdf skill for merge/split/forms.
---
name: liteparse
description: Local document and PDF parsing with spatial text and bounding boxes. Use for extracting text from PDFs, DOCX, Office files, and images; OCR on scans; layout-preserved JSON for RAG; batch-ingesting paper folders; or page screenshots for multimodal agents — even when the user does not name liteparse. Prefer over MarkItDown when you need bboxes, fast local parsing, or PNG page renders; prefer over the pdf skill for merge/split/forms.
license: Apache-2.0
allowed-tools: Read Write Edit Bash
compatibility: Python 3.10+. Optional LibreOffice (Office formats) and ImageMagick (images). Bundled Tesseract for OCR. All processing is local — no cloud API required.
metadata:
version: "1.0"
skill-author: K-Dense Inc.
---
# LiteParse — Local Document Parsing
## Overview
LiteParse is a fast, open-source document parser (Rust core, Python/Node bindings) focused on **local, layout-aware text extraction** with bounding boxes. It does not produce Markdown and does not call cloud LLMs. Outputs are **plain text** (layout-preserved) or **structured JSON** with per-page `text_items` (position, font metadata, optional confidence).
**Version note:** Examples target **liteparse 2.0.0** (PyPI, May 2026). The upstream V1 branch is legacy; this skill documents **V2 / main** only.
For parser selection vs MarkItDown, the `pdf` skill, or LlamaParse, see `references/choosing_a_parser.md`.
## When to Use This Skill
Use LiteParse when you need:
- **Fast local parsing** of PDFs or converted Office/image files without cloud dependencies
- **Spatial text** with bounding boxes for layout-aware RAG, citation grounding, or figure/table region logic
- **OCR** on scanned PDFs or images (bundled Tesseract, or a user-run HTTP OCR server)
- **Page screenshots** (PNG) for multimodal agents that must see charts, figures, or handwriting
- **Batch ingestion** of literature folders, supplementary PDFs, or protocol libraries
- **Page subsets** or **password-protected** PDFs
## When Not to Use
| Task | Use instead |
|------|-------------|
| Markdown for LLM ingestion (EPUB, audio, YouTube, HTML) | `markitdown` skill |
| Merge/split PDFs, forms, watermarks, rotation | `pdf` skill |
| Dense tables, handwriting, production cloud pipelines | [LlamaParse](https://docs.cloud.llamaindex.ai/llamaparse/overview) (cloud; sign up separately) |
## Installation
```bash
uv pip install "liteparse==2.0.0"
```
This installs the Python bindings and the **`lit`** CLI. Verify:
```bash
lit --help
python -c "import liteparse; print(liteparse.__version__)"
```
**Optional system tools** (for non-PDF inputs):
- **LibreOffice** — Word, Excel, PowerPoint, OpenDocument, CSV/TSV
- **ImageMagick** — PNG, JPEG, TIFF, WebP, SVG, etc.
Install commands are in `references/ocr_and_formats.md`.
**Node.js / TypeScript** (optional): `npm i @llamaindex/liteparse` — see `references/api_reference.md`.
---
## Quick Start
### Python
```python
from liteparse import LiteParse
parser = LiteParse(quiet=True)
result = parser.parse("paper.pdf")
print(result.text)
for page in result.pages:
print(f"Page {page.page_num}: {len(page.text_items)} items")
```
### CLI
```bash
# Layout-preserved text (default)
lit parse paper.pdf
# Structured JSON with bounding boxes
lit parse paper.pdf --format json -o paper.json
# Disable OCR on text-native PDFs (faster)
lit parse paper.pdf --no-ocr
```
---
## Core Workflows
### 1. Parse to layout-preserved text
Best for quick full-document text or feeding chunkers that do not need coordinates.
```python
parser = LiteParse(ocr_enabled=True, quiet=True)
result = parser.parse("document.pdf")
full_text = result.text
```
```bash
lit parse document.pdf -o output.txt
```
### 2. Parse to structured JSON (bounding boxes)
Use when building layout-aware RAG, highlighting source regions, or joining text with screenshots.
```python
import json
from liteparse import LiteParse
parser = LiteParse(output_format="json", quiet=True)
result = parser.parse("document.pdf")
# Programmatic access
for page in result.pages:
for item in page.text_items:
bbox = (item.x, item.y, item.width, item.height)
# item.text, item.confidence, item.font_name, item.font_size
```
```bash
lit parse document.pdf --format json -o document.json
```
JSON field layout: `references/output_formats.md`.
### 3. Parse specific pages
```python
parser = LiteParse(target_pages="1-5,10,15-20", quiet=True)
result = parser.parse("long_paper.pdf")
```
```bash
lit parse long_paper.pdf --target-pages "1-5,10"
```
### 4. Parse from bytes or stdin
Useful for uploads, S3 downloads, or piping remote PDFs.
```python
with open("document.pdf", "rb") as f:
result = parser.parse(f.read())
```
```bash
curl -sL https://example.com/report.pdf | lit parse -
```
### 5. Page screenshots for multimodal agents
Screenshots capture visual content that text extraction alone misses (figures, complex tables, handwriting).
```python
from pathlib import Path
parser = LiteParse(dpi=150, quiet=True)
shots = parser.screenshot("document.pdf", page_numbers=[1, 2, 3])
out = Path("screenshots")
out.mkdir(exist_ok=True)
for s in shots:
(out / f"page_{s.page_num}.png").write_bytes(s.image_bytes)
```
```bash
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots
lit screenshot document.pdf --dpi 300 -o ./screenshots
```
Combine **JSON parse + screenshots** when an agent needs both coordinates and pixels for the same pages.
### 6. Batch-parse a directory
For large corpora, prefer the CLI (parallel OCR workers) or the bundled script.
```bash
lit batch-parse ./papers ./parsed --format json --recursive
lit batch-parse ./papers ./parsed --extension .pdf --no-ocr
```
```bash
python scripts/batch_parse_dir.py ./papers ./parsed --format json --recursive
```
See `scripts/batch_parse_dir.py` for a Python batch wrapper without network calls.
### 7. OCR configuration
OCR is **on by default**. Tesseract is bundled; no extra install for basic English OCR.
```python
parser = LiteParse(
ocr_enabled=True,
ocr_language="eng", # Tesseract codes: fra, deu, etc.
num_workers=4, # parallel OCR (default: CPU cores - 1)
dpi=150, # higher DPI → better OCR, slower
)
```
```bash
lit parse scan.pdf --ocr-language fra
lit parse scan.pdf --no-ocr
lit parse scan.pdf --ocr-server-url http://localhost:8080/ocr
```
**Offline / air-gapped:** set `TESSDATA_PREFIX` to a directory of `.traineddata` files, or pass `--tessdata-path`. Details: `references/ocr_and_formats.md`.
### 8. Encrypted PDFs
```python
parser = LiteParse(password="secret", quiet=True)
result = parser.parse("protected.pdf")
```
```bash
lit parse protected.pdf --password secret
```
### 9. Search text items by phrase
Merge adjacent items and return combined bounding boxes for a phrase (e.g. section titles).
```python
from liteparse import search_items
page = result.get_page(1)
matches = search_items(page.text_items, "Materials and Methods", case_sensitive=False)
```
---
## Multi-Format Inputs
| Category | Extensions (examples) | Requirement |
|----------|----------------------|-------------|
| PDF | `.pdf` | Native |
| Office | `.docx`, `.xlsx`, `.pptx`, `.doc`, `.odt`, … | LibreOffice |
| Images | `.png`, `.jpg`, `.tiff`, `.webp`, `.svg`, … | ImageMagick |
Files are converted to PDF internally, then parsed. If conversion tools are missing, parsing fails with an actionable error — install the dependency and retry.
---
## Performance Tips
- **`--no-ocr`** on born-digital PDFs — largest speedup
- **`target_pages`** — parse only methods/supplement sections
- **`num_workers`** — scale OCR across CPU cores
- **`max_pages`** — cap very large files (default 1000)
- **`lit batch-parse`** — directory-scale jobs with `--recursive` and `--extension`
- Lower **`dpi`** (e.g. 100) when OCR quality is already sufficient
---
## Reference Files
| File | Read when |
|------|-----------|
| `references/choosing_a_parser.md` | Unsure whether to use LiteParse, MarkItDown, pdf, or LlamaParse |
| `references/api_reference.md` | Python/TypeScript API, types, `search_items` |
| `references/cli_reference.md` | Full `lit` command flags |
| `references/output_formats.md` | JSON schema, bboxes, confidence scores |
| `references/ocr_and_formats.md` | Tesseract, HTTP OCR, LibreOffice, ImageMagick |
---
## Troubleshooting
| Issue | Fix |
|-------|-----|
| Office file fails | Install LibreOffice; ensure `soffice` is on PATH (Windows: add LibreOffice `program` dir) |
| Image fails | Install ImageMagick; verify `convert` or `magick` works |
| OCR poor quality | Increase `--dpi`; try `--ocr-language`; or HTTP OCR server |
| OCR slow | `--no-ocr` if not needed; reduce pages; increase `num_workers` |
| Air-gapped OCR | `export TESSDATA_PREFIX=/path/to/tessdata` or `--tessdata-path` |
| `ParseError` on bytes | Ensure input is valid PDF bytes (Office bytes need a file path + conversion) |
---
## Resources
- **GitHub**: https://github.com/run-llama/liteparse
- **Docs**: https://developers.llamaindex.ai/liteparse/
- **PyPI**: https://pypi.org/project/liteparse/2.0.0/
- **npm**: https://www.npmjs.com/package/@llamaindex/liteparse
- **OCR API spec**: https://github.com/run-llama/liteparse/blob/main/OCR_API_SPEC.md
Creator's repository · k-dense-ai/scientific-agent-skills
License: Apache-2.0