Capturing sources
You capture a source by pointing Khiip at its URL:
khiipd capture https://example.com/some/articleKhiip routes the URL to the right extractor, which emits a Pydantic-typed payload, renders canonical Markdown into your vault, and preserves the raw Source-tier bytes. (Optionally, if you enable it, it also submits the URL to the Wayback Machine as a witness — see below.)
Sources today
| Source | What it captures |
|---|---|
| X | Full QRT chains, X-Article body (block-structured), embedded media, engagement metrics, community notes. Works anonymously via fxtwitter. |
| Post + recursive comment tree (deep “load more” branches followed credential-free) + galleries (distinct full-res images) + crosspost + removed-status preservation. Credential-free by default (old.reddit HTML); an optional Reddit app adds rate headroom + gallery dimensions/captions — see Installation. | |
| Wikipedia | Structured article via the MediaWiki action API (sections + page image + canonical URL) → REST summary (fallback); references + infobox best-effort. |
| Generic web | Article body via trafilatura (primary) → readability (fallback) → OG/JSON-LD enrichment. |
| YouTube | Metadata + transcripts via yt-dlp → oEmbed + transcript-api → Data API v3 (the optional API key widens the chain). |
| Text + structure via markitdown → pdfplumber (fallback). |
Instagram, TikTok, Threads, and Bluesky are on the roadmap.
What lands per capture
- Canonical Markdown with YAML frontmatter under
~/khiip-vault/captures/<source>/ - A typed payload (
TweetPayload,RedditPayload,WebPayload,WikiPayload,YouTubePayload,PDFPayload) — see Typed payloads - Raw Source-tier bytes preserved under your configured
data_root, as insurance against upstream rot - A Wayback witness (opt-in; off by default) — archive.org’s anonymous Save-Page-Now
is rate-limited and unreliable, so it’s off unless you set
[archive] wayback_enabled = true. When on, it’s best-effort: the result lands inarchive_urlsand failures are quiet (no callout). Reliable archiving needs your own archive.org credentials (a BYO-credentials tier is planned).
Media
Media fetching walks a registry: HttpxFetcher (photos) → optionally YtDlpFetcher
(video; opt-in via [media] download_videos = true) → GalleryDlFetcher
(wide-coverage fallback). Video preservation is opt-in and off by default.
Partial success
If extraction succeeds but media or Wayback fails, the capture still lands — each sub-system reports its own status independently. See Failure handling (P-δ).