2 pointsby pplonski867 hours ago1 comment

pplonski867 hours ago
We built a benchmark to evaluate LLMs on real data analysis workflows. Instead of single prompts, each task is a sequence of prompts (steps). It is similar to how a human data analyst works in practice. Each run is saved as full python notebook, including prompts, code and outputs. We evaluated runs across task completion, code correctess, output quality, reasoning and reliability. Each workflow is execuuted multiple times and scored automatically.
Modern LLMs perform very well on individual steps. The benchmark currently inludes 23 workflows from different data analysis tasks (EDA, ML, NLP, statistics ...). The top-3 models across the 23 workflows, gpt-oss:120b scored 9.87/10, followed by gpt-5.4 at 9.65/10, glm-5.1 at 9.48/10. Which is very high in my opinion. The results show that modern LLMs perform very well on data analysis tasks. All feedback is welcome! I uploaded all notebooks for each model https://github.com/pplonski/ai-for-data-analysis