103 pointsby xx1231227 hours ago9 comments
  • esafak16 minutes ago
    I'd have titled the submission 'Data Engineering for LLMs...' as it is focused on that.
  • joshuaissac5 hours ago
    • xx1231228 minutes ago
      Hi HN, OP here!

      I'm currently a Master's student at USTC (University of Science and Technology of China). I've been diving deep into Data Engineering, especially in the context of Large Language Models (LLMs).

      The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system.

      The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve.

      Key Features:

      LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems.

      Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search").

      Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples.

      This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included!

      Check it out:

      Online: https://datascale-ai.github.io/data_engineering_book/

      GitHub: https://github.com/datascale-ai/data_engineering_book

    • dang5 hours ago
      Oh thanks! I've switched the top URL to that now. Submitted URL was https://github.com/datascale-ai/data_engineering_book.

      I hope xx123122 won't mind my mentioning that they emailed us about this post, which originally got caught in a spam filter. I invited them to post a comment giving the background to the project but they probably haven't seen my reply yet. Hopefully soon, given that the post struck a chord!

      Edit: they did, and I've moved that post to the toptext.

  • guillem_lefait3 hours ago
    The figures in the different chapters are in english (it's not the case for the image in README_en.md).
  • 30 minutes ago
    undefined
  • 2 hours ago
    undefined
  • 2 hours ago
    undefined
  • dvrp4 hours ago
    If you are interested in (2026-)internet scale data engineering challenges (e.g. 10-100s of petabyte processing) challenges and pre-training/mid-training/post-training scale challenges, please send me an email to d+data@krea.ai !
  • 5 hours ago
    undefined
    • 5 hours ago
      undefined
  • rafavargascom5 hours ago
    谢谢

    How is possible a Chinese publication gets to the top in HN?