1 pointby tfearn2 hours ago1 comment
  • tfearn2 hours ago
    I've been building data infrastructure for 25+ years across Goldman, Bridgewater, Freddie Mac. The same problem exists everywhere: getting a new external data source wired up takes days. You have to figure out what datasets the source even exposes, write the ingestion code, build a pipeline, wire up scheduling, and test it — before a single byte of data lands anywhere useful.

    The data discovery process that we just added to our platform (open-source) collapses that into one session.

    You describe what you want in plain English ("company earnings", "option chain data"," "SEC EDGAR company filings"). The AI identifies the source, enumerates every dataset it exposes — grouped by category with parameters and auth requirements — and you select what you want. From there, Discovery generates Python tap scripts in parallel, runs them immediately as a test, self-heals on failure (up to 3 attempts), creates the pipelines, and optionally schedules everything. The whole thing drops into a Data Catalog that groups related taps and pipelines together.

    The artifact isn't a one-time wizard output — it creates real, editable tap scripts and pipeline configs you can modify afterward. Parameters can be sourced from a table you already have ingested, a file upload, or an AI-generated list ("give me S&P 500 tickers"). Date tokens like `{{TODAY}}` are substituted at runtime so daily snapshots just work.

    The same flow is also exposed as an MCP tool (`discover_source`), so external AI agents can drive Discovery programmatically — ask "what datasets are in polygon?" and get back the same structured dataset catalog the wizard uses.

    Destinations: MongoDB, PostgreSQL, Kafka, MinIO, pgvector, Qdrant, Milvus, Chroma, Weaviate, ActiveMQ, REST endpoints.

    Full demo walkthrough: https://datris.ai/videos/data-discovery-ingestion-consumptio... Docs: https://docs.datris.ai/discovery OSS (AGPL): https://github.com/datris/datris-platform-oss