v0.1.0 — Apache 2.0

Index. Search. Detect.

Video intelligence toolkit for AI agents. Video Memory — ingest, search, and ask questions with RAG citations. Sentinel — detect falls, queues, and crowd events in CCTV footage.

View on GitHub

~/project

$ av ingest meeting.mp4

{"status": "complete", "video_id": "a1b2c3", "duration_sec": 3600, "artifacts_count": 847}

$ av search "what was discussed about pricing"

{"results": [{"rank": 1, "score": 0.87, "timestamp": "00:24:15", "text": "..."}]}

$ av ask "what were the key decisions?"

{"answer": "Three key decisions were made...", "citations": [{"timestamp": "00:24:15"}]}

Give your agent video memory

Paste this to your AI agent — Claude, GPT, Cursor, or any agent with tool use.

Read https://agentic.video/skill.md and follow the instructions to set up video memory.

Paste to your agent

Copy the instruction above and paste it into any AI agent chat.

Agent sets up av

Your agent installs pixelml-av, configures a provider, and ingests video.

Search and ask

Start querying your videos. Your agent can search, ask questions, and get citations.

Need managed infrastructure? hello@pixelml.com

How it works

Ingest

ffmpeg extracts audio. Whisper transcribes. Embeddings are generated. Everything lands in a single SQLite file.

Search

FTS5 full-text search as primary. Cosine similarity reranking when embeddings are available. Fast, local, no network needed.

Ask

RAG Q&A over your indexed videos. Get answers with timestamped citations pointing back to the source.

architecture

video file / URL
       │
       ▼
┌─────────────────────────────────┐
│  av ingest                      │
│  ├─ ffmpeg → audio → Whisper    │
│  ├─ ffmpeg → frames → Vision    │
│  ├─ Embeddings (batch)          │
│  └─ SQLite (FTS5 + vectors)     │
└─────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────┐
│  av search / av ask             │
│  ├─ FTS5 full-text match        │
│  ├─ Cosine reranking            │
│  └─ RAG Q&A with citations      │
└─────────────────────────────────┘

Sentinel — Surveillance Intelligence

Detect events using temporal reasoning over VLM observations. Built on 107 experiments across 21 vision models.

FALL

Position tracking — standing→lying transition across frames (F1=0.944)

LONG_QUEUE

Temporal persistence — queue detected in 3+ consecutive chunks (90s)

CROWD_GATHERING

Density + growth — sustained crowd or rapid person count increase

WHEELCHAIR_COMPLIANCE

Service timing — wheelchair user unattended > threshold

sentinel

# Cloud (quick start — Gemini free tier)

$ export AV_API_KEY=your-gemini-key

$ av sentinel video.mp4

# Local (free, private — runs on your Mac/GPU)

$ ollama pull mistral-small3.2

$ av sentinel video.mp4 --provider ollama

# Specific alerts

$ av sentinel video.mp4 --alerts FALL,LONG_QUEUE

Provider	Setup	Cost	Speed
Gemini	`export AV_API_KEY=key`	Free tier available	~5s/chunk
OpenRouter	`export OPENROUTER_API_KEY=key`	$0.04-0.14/1M tokens	~10s/chunk
Ollama (local)	`ollama pull mistral-small3.2`	Free	~25s/chunk
OpenAI	`export AV_API_KEY=key`	$$$	~5s/chunk

Auto-detection: if no provider specified, av tries Gemini → OpenRouter → ollama → OpenAI.

Built for agents

JSON to stdout

Structured output on stdout, progress on stderr. Agents parse one, humans read the other.

Single SQLite file

No Postgres, no Redis, no external dependencies. One file at ~/.config/av/av.db.

FTS5 primary search

Full-text search works without embeddings. Cosine reranking is optional — works offline.

Provider-agnostic

OpenAI, Anthropic, Gemini. Switch providers with av config setup. One interface.

Best-effort pipeline

If a stage fails (auth, model access), the pipeline continues and warns. No hard crashes.

YouTube URL support

Pass a YouTube URL to av ingest. Uses yt-dlp under the hood to download and index.

Command reference

config

# Interactive setup wizard

$ av config setup

# Show current config

$ av config show

{"provider": "openai", "transcribe_model": "whisper-1"}

ingest

# Ingest a video file

$ av ingest video.mp4

# With frame captions

$ av ingest video.mp4 --captions

# YouTube URL

$ av ingest "https://youtu.be/..."

$ av search "pricing discussion"

{
"results": [{
"rank": 1,
"score": 0.87,
"timestamp": "00:24:15",
"text": "We agreed on the $49/mo tier..."
}]
}

ask

$ av ask "what were the key decisions?"

{
"answer": "Three key decisions...",
"citations": [{
"timestamp": "00:24:15",
"score": 0.91
}],
"confidence": 0.85
}

Command	Description
`av config setup`	Interactive provider setup wizard
`av config show`	Show current configuration
`av ingest <path>`	Ingest video file(s) into the index
`av search <query>`	Full-text + semantic search
`av ask <question>`	RAG Q&A with citations
`av sentinel <path>`	Detect events (FALL, LONG_QUEUE, CROWD, WHEELCHAIR)
`av list`	List all indexed videos
`av info <video_id>`	Detailed video metadata
`av transcript <id>`	Output transcript (VTT/SRT/text)
`av export`	Export as JSONL/VTT/SRT
`av open <id> --at <sec>`	Open video at timestamp

Provider compatibility

Switch providers with av config setup. The pipeline adapts automatically.

Provider	Transcription	Vision / Chat	Embeddings
OpenAI (OAuth)	`whisper-1`	`gpt-4-1`	`text-embedding-3-small`
OpenAI (API key)	`whisper-1`	`gpt-4-1`	`text-embedding-3-small`
Anthropic	—	`claude-sonnet-4-5`	—
Gemini	—	`gemini-2.5-flash`	`text-embedding-004`

When a capability is unavailable, the pipeline skips that stage and warns. Use AV_OPENAI_API_KEY as a transcription fallback for non-OpenAI providers.

Get started

Three commands to searchable video.

quickstart

# Install

$ pip install pixelml-av

# Configure your provider

$ av config setup

# Index a video

$ av ingest meeting.mp4

# Search it

$ av search "action items"

GitHub PyPI

Requires Python 3.11+ and FFmpeg