Video intelligence toolkit for AI agents. Video Memory — ingest, search, and ask questions with RAG citations. Sentinel — detect falls, queues, and crowd events in CCTV footage.
Paste this to your AI agent — Claude, GPT, Cursor, or any agent with tool use.
Read https://agentic.video/skill.md and follow the instructions to set up video memory.Copy the instruction above and paste it into any AI agent chat.
Your agent installs pixelml-av, configures a provider, and ingests video.
Start querying your videos. Your agent can search, ask questions, and get citations.
ffmpeg extracts audio. Whisper transcribes. Embeddings are generated. Everything lands in a single SQLite file.
FTS5 full-text search as primary. Cosine similarity reranking when embeddings are available. Fast, local, no network needed.
RAG Q&A over your indexed videos. Get answers with timestamped citations pointing back to the source.
video file / URL
│
▼
┌─────────────────────────────────┐
│ av ingest │
│ ├─ ffmpeg → audio → Whisper │
│ ├─ ffmpeg → frames → Vision │
│ ├─ Embeddings (batch) │
│ └─ SQLite (FTS5 + vectors) │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ av search / av ask │
│ ├─ FTS5 full-text match │
│ ├─ Cosine reranking │
│ └─ RAG Q&A with citations │
└─────────────────────────────────┘Detect events using temporal reasoning over VLM observations. Built on 107 experiments across 21 vision models.
Position tracking — standing→lying transition across frames (F1=0.944)
Temporal persistence — queue detected in 3+ consecutive chunks (90s)
Density + growth — sustained crowd or rapid person count increase
Service timing — wheelchair user unattended > threshold
| Provider | Setup | Cost | Speed |
|---|---|---|---|
| Gemini | export AV_API_KEY=key | Free tier available | ~5s/chunk |
| OpenRouter | export OPENROUTER_API_KEY=key | $0.04-0.14/1M tokens | ~10s/chunk |
| Ollama (local) | ollama pull mistral-small3.2 | Free | ~25s/chunk |
| OpenAI | export AV_API_KEY=key | $$$ | ~5s/chunk |
Auto-detection: if no provider specified, av tries Gemini → OpenRouter → ollama → OpenAI.
Structured output on stdout, progress on stderr. Agents parse one, humans read the other.
No Postgres, no Redis, no external dependencies. One file at ~/.config/av/av.db.
Full-text search works without embeddings. Cosine reranking is optional — works offline.
OpenAI, Anthropic, Gemini. Switch providers with av config setup. One interface.
If a stage fails (auth, model access), the pipeline continues and warns. No hard crashes.
Pass a YouTube URL to av ingest. Uses yt-dlp under the hood to download and index.
| Command | Description |
|---|---|
av config setup | Interactive provider setup wizard |
av config show | Show current configuration |
av ingest <path> | Ingest video file(s) into the index |
av search <query> | Full-text + semantic search |
av ask <question> | RAG Q&A with citations |
av sentinel <path> | Detect events (FALL, LONG_QUEUE, CROWD, WHEELCHAIR) |
av list | List all indexed videos |
av info <video_id> | Detailed video metadata |
av transcript <id> | Output transcript (VTT/SRT/text) |
av export | Export as JSONL/VTT/SRT |
av open <id> --at <sec> | Open video at timestamp |
Switch providers with av config setup. The pipeline adapts automatically.
| Provider | Transcription | Vision / Chat | Embeddings |
|---|---|---|---|
| OpenAI (OAuth) | whisper-1 | gpt-4-1 | text-embedding-3-small |
| OpenAI (API key) | whisper-1 | gpt-4-1 | text-embedding-3-small |
| Anthropic | — | claude-sonnet-4-5 | — |
| Gemini | — | gemini-2.5-flash | text-embedding-004 |
When a capability is unavailable, the pipeline skips that stage and warns. Use AV_OPENAI_API_KEY as a transcription fallback for non-OpenAI providers.