Show HN: Data Engineering Book – An open source, community-driven guide(github.com)

103 pointsby xx1231227 hours ago9 comments

esafak16 minutes ago
I'd have titled the submission 'Data Engineering for LLMs...' as it is focused on that.
joshuaissac5 hours ago
English version: https://github.com/datascale-ai/data_engineering_book/blob/m...
- xx1231228 minutes ago
  Hi HN, OP here!
  I'm currently a Master's student at USTC (University of Science and Technology of China). I've been diving deep into Data Engineering, especially in the context of Large Language Models (LLMs).
  The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system.
  The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve.
  Key Features:
  LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems.
  Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search").
  Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples.
  This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included!
  Check it out:
  Online: https://datascale-ai.github.io/data_engineering_book/
  GitHub: https://github.com/datascale-ai/data_engineering_book
- dang5 hours ago
  Oh thanks! I've switched the top URL to that now. Submitted URL was https://github.com/datascale-ai/data_engineering_book.
  I hope xx123122 won't mind my mentioning that they emailed us about this post, which originally got caught in a spam filter. I invited them to post a comment giving the background to the project but they probably haven't seen my reply yet. Hopefully soon, given that the post struck a chord!
  Edit: they did, and I've moved that post to the toptext.
guillem_lefait3 hours ago
The figures in the different chapters are in english (it's not the case for the image in README_en.md).
30 minutes ago
undefined
2 hours ago
undefined
2 hours ago
undefined
dvrp4 hours ago
If you are interested in (2026-)internet scale data engineering challenges (e.g. 10-100s of petabyte processing) challenges and pre-training/mid-training/post-training scale challenges, please send me an email to d+data@krea.ai !
5 hours ago
undefined
- 5 hours ago
  undefined
rafavargascom5 hours ago
谢谢
How is possible a Chinese publication gets to the top in HN?
- rafavargascom5 hours ago
  Nevermind.