Skip to content

Capturing sources

You capture a source by pointing Khiip at its URL:

Terminal window
khiipd capture https://example.com/some/article

Khiip routes the URL to the right extractor, which emits a Pydantic-typed payload, renders canonical Markdown into your vault, and preserves the raw Source-tier bytes. (Optionally, if you enable it, it also submits the URL to the Wayback Machine as a witness — see below.)

Sources today

SourceWhat it captures
XFull QRT chains, X-Article body (block-structured), embedded media, engagement metrics, community notes. Works anonymously via fxtwitter.
RedditPost + recursive comment tree (deep “load more” branches followed credential-free) + galleries (distinct full-res images) + crosspost + removed-status preservation. Credential-free by default (old.reddit HTML); an optional Reddit app adds rate headroom + gallery dimensions/captions — see Installation.
WikipediaStructured article via the MediaWiki action API (sections + page image + canonical URL) → REST summary (fallback); references + infobox best-effort.
Generic webArticle body via trafilatura (primary) → readability (fallback) → OG/JSON-LD enrichment.
YouTubeMetadata + transcripts via yt-dlp → oEmbed + transcript-api → Data API v3 (the optional API key widens the chain).
PDFText + structure via markitdown → pdfplumber (fallback).

Instagram, TikTok, Threads, and Bluesky are on the roadmap.

What lands per capture

  • Canonical Markdown with YAML frontmatter under ~/khiip-vault/captures/<source>/
  • A typed payload (TweetPayload, RedditPayload, WebPayload, WikiPayload, YouTubePayload, PDFPayload) — see Typed payloads
  • Raw Source-tier bytes preserved under your configured data_root, as insurance against upstream rot
  • A Wayback witness (opt-in; off by default) — archive.org’s anonymous Save-Page-Now is rate-limited and unreliable, so it’s off unless you set [archive] wayback_enabled = true. When on, it’s best-effort: the result lands in archive_urls and failures are quiet (no callout). Reliable archiving needs your own archive.org credentials (a BYO-credentials tier is planned).

Media

Media fetching walks a registry: HttpxFetcher (photos) → optionally YtDlpFetcher (video; opt-in via [media] download_videos = true) → GalleryDlFetcher (wide-coverage fallback). Video preservation is opt-in and off by default.

Partial success

If extraction succeeds but media or Wayback fails, the capture still lands — each sub-system reports its own status independently. See Failure handling (P-δ).