The three high score patterns are interesting as well. "Conceptual Inquiry" actually results in less time and doesn't improve the score than the other two, which is quite surprising to me.
The models are too good now. One thing I've noticed recently is that I've stopped dreaming about tough problems, be it code or math. The greatest feeling in the world is pounding your head against a problem for a couple of days and waking up the next morning with the solution sketched out in your mind.
I don't think the solution is to be going full natty with things, but to work more alongside the code in an editor, rather than doing things in CLI.
The amount of context switching in my day-to-day work has become insane. There's this culture of “everyone should be able to do everything” (within reason, sure), but in practice it means a data scientist is expected to touch infra code if needed.
Underneath it all is an unspoken assumption that people will just lean on LLMs to make this work.
I also used to get great pleasure from the banging head and then the sudden revelation.
But that takes time. I was valuable when there was no other option. Now? Why would someone wait when an answer is just a prompt away.
They can give plausible architecture but most of the time it’s not usable if you’re starting from scratch.
When you design the system, you’re an architect not a coder, so I see no difference between handing the design to agents or other developers, you’ve done the heavy lifting.
In that perspective, I find LLMs quite useful for learning. But instead of coding, I find myself in long sessions back and forth to ask questions, requesting examples, sequence diagrams .. etc to visualise the final product.
It is a pattern matching problem and that seems to me to be something AI is/will be particularly good at.
Maybe it won’t be the perfect architecture, or the most efficient implementation. But that doesn’t seem to have stopped many companies before.
My first thought was that I can abstract what I wrote yesterday, which was a variation of what I built over the previous week. My second thought was a physiological response of fear that today is going to be a hard hyper focus day full of frustration, and that the coding agents that built this will not be able to build a modular, clean abstraction. That was followed by weighing whether it is better to have multiple one off solutions, or to manually create the abstraction myself.
I agree with you 100 percent that the poor performance of models like GPT 4 introduced some kind of regularization in the human in loop coding process.
Nonetheless, we live in a world of competition, and the people who develop techniques that give them an edge will succeed. There is a video about the evolution of technique in the high jump, the Western Roll, the Straddle Technique, and finally the Fosbury Flop. Using coding agents will be like this too.
I am working with 150 GB of time series data. There are certain pain points that need to be mitigated. For example, a different LLM model has to be coerced into analyzing or working with the data from a completely different approach in order to validate. That means instead of being 4x faster, each iteration is 4x faster, and it needs to be done twice, so it still is only 2x faster. I burned $400 in tokens in January. This cannot be good for the environment.
Timezone handling always has to be validated manually. Every exploration of the data is a train and test split. Here is the thing that hurts the most. The AI coding agents always show the top test results, not the test results of the top train results. Rather than tell me a model has no significant results, it will hide that and only present the winning outliers, which is misleading and, like the OP research suggests, very dangerous.
A lot of people are going to get burned before the techniques to mitigate this are developed.
Overfitting has always been a problem when working with data. Just because the barrier of entry for time series work is much lower does not mean that people developing the skill, whether using old school tools like ARIMA manually or having AI do the work, escape the problem of overfitting. The models will always show the happy, successful looking results.
Just like calculators are used when teaching higher math at the secondary level so basic arithmetic does not slow the process of learning math skills, AI will be used in teaching too. What we are doing is confusing techniques that have not been developed yet with not being able to acquire skills. I wrack and challenge my brain every day solving these problems. As millions of other software engineers do as well, the patterns will emerge and later become the skills taught in schools.
For hobby projects though, it's awesome. It just really struggles to do things right in the big codebase at work.
And then you find out someone else had already solved it. So might as well use the Google 2.0 aka ChatGPT.
> "We collect self-reported familiarity with AI coding tools, but we do not actually measure differences in prompting techniques."
Many people drive cars without being able to explain how cars work. Or use devices like that. Or interact with people who's thinking they can't explain. Society works like that, it is functional, does not work by full understanding. We need to develop the functional part not the full understanding part. We can write C without knowing the machine code.
You can often recognize a wrong note without being able to play the piece, spot a logical fallacy without being able to construct the valid argument yourself, catch a translation error with much less fluency than producing the translation would require. We need discriminative competence, not generative.
For years I maintained a library for formatting dates and numbers (prices, ints, ids, phones), it was a pile of regex but I maintained hundreds of test cases for each type of parsing. And as new edge cases appeared, I added them to my tests, and iterated to keep the score high. I don't fully understand my own library, it emerged by scar accumulation. I mean, yes I can explain any line, but why these regexes in this order is a data dependent explanation I don't have anymore, all my edits run in loop with tests and my PRs are sent only when the score is good.
Correctness was never grounded in understanding the implementation. Correctness was grounded in the test suite.
But the fundamentals all cars behave the same way all the time. Imagine running a courier company where sometimes the vehicles take a random left turn.
> Or interact with people who's thinking they can't explain
Sure but they trust those service providers because they are reliable . And the reason that they are reliable is that the service providers can explain their own thinking to themselves. Otherwise their business would be chaos and nobody would trust them.
How you approached your library was practical given the use case. But can you imagine writing a compiler like this? Or writing an industrial automation system? Not only would it be unreliable but it would be extremely slow. It's much faster to deal with something that has a consistent model that attempts to distill the essence of the problem, rather than patching on hack by hack in response to failed test after failed test.
I think being a programmer is closer to being an aircraft pilot than a car driver.
But isn't the corrections of those errors that are valuable to society and get us a job?
People can tell they found a bug or give a description about what they want from a software, yet it requires skills to fix the bugs and to build software. Though LLMs can speedup the process, expert human judgment is still required.
If you know that you need O(n) "contains" checks and O(1) retrieval for items, for a given order of magnitude, it feels like you've all the pieces of the puzzle needed to make sure you keep the LLM on the straight and narrow, even if you didn't know off the top of your head that you should choose ArrayList.
Or if you know that string manipulation might be memory intensive so you write automated tests around it for your order of magnitude, it probably doesn't really matter if you didn't know to choose StringBuilder.
That feels different to e.g. not knowing the difference between an array list and linked list (or the concept of time/space complexity) in the first place.
Tests only cover cases you already know to look for. In my experience, many important edge cases are discovered by reading the implementation and noticing hidden assumptions or unintended interactions.
When something goes wrong, understanding why almost always requires looking at the code, and that understanding is what informs better tests.
Instead, just learning concepts with AI and then using HI (Human Intelligence) & AI to solve the problem at hand—by going through code line by line and writing tests - is a better approach productivity-, correctness-, efficiency-, and skill-wise.
I can only think of LLMs as fast typists with some domain knowledge.
Like typists of government/legal documents who know how to format documents but cannot practice law. Likewise, LLMs are code typists who can write good/decent/bad code but cannot practice software engineering - we need, and will need, a human for that.
> AI assistance produces significant productivity gains across professional domains, particularly for novice workers. Yet how this assistance affects the development of skills required to effectively supervise AI remains unclear. Novice workers who rely heavily on AI to complete unfamiliar tasks may compromise their own skill acquisition in the process. We conduct randomized experiments to study how developers gained mastery of a new asynchronous programming library with and without the assistance of AI. We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average. Participants who fully delegated coding tasks showed some productivity improvements, but at the cost of learning the library. We identify six distinct AI interaction patterns, three of which involve cognitive engagement and preserve learning outcomes even when participants receive AI assistance. Our findings suggest that AI-enhanced productivity is not a shortcut to competence and AI assistance should be carefully adopted into workflows to preserve skill formation -- particularly in safety-critical domains.
I assistance produces significant productivity gains across professional domains, particularly for novice workers.
We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average.
Are the two sentences talking about non-overlapping domains? Is there an important distinction between productivity and efficiency gains? Does one focus on novice users and one on experienced ones? Admittedly did not read the paper yet, might be clearer than the abstract.
The research question is: "Although the use of AI tools may improve productivity for these engineers, would they also inhibit skill formation? More specifically, does an AI-assisted task completion workflow prevent engineers from gaining in-depth knowledge about the tools used to complete these tasks?" This hopefully makes the distinction more clear.
So you can say "this product helps novice workers complete tasks more efficiently, regardless of domain" while also saying "unfortunately, they remain stupid." The introductiory lit review/context setting cites prior studies to establish "ok coders complete tasks efficiently with this product." But then they say, "our study finds that they can't answer questions." They have to say "earlier studies find that there were productivity gains" in order to say "do these gains extend to other skills? Maybe not!"
I learned a lot more in a short amount of time than I would've stumbling around on my own.
Afaik its been known for a long time that the most effective way of learning a new skill, is to get private tutoring from an expert.
But that's what "impairs learning" means.
> Novice workers who rely heavily on AI to complete unfamiliar tasks may compromise their own skill acquisition in the process. We conduct randomized experiments to study how developers gained mastery of a new asynchronous programming library with and without the assistance of AI. We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average.
The library in question was Python trio and the model they used was GPT-4o.
you can also read human code and empathise what were they thinking while writing it
AI code is not for humans, it is just a stream of tokens that do something, you need to build skills to empirically verify that it does what you think it does, but it is pointless to "reason" about it.
Previous title: "Anthropic: AI Coding shows no productivity gains; impairs skill development"
The previous title oversimplified the claim to "all" developers. I found the previous title meaningful while submitting this post because most of the false AI claims of "software engineer is finished" has mostly affected junior `inexperienced` engineers. But I think `junior inexperienced` was implicit which many people didn't pick.
The paper makes a more nuanced claim that AI Coding speeds up work for inexperienced developers, leading to some productivity gains at the cost of actual skill development.
Yes, we can use it 10,000 times to refine our recipes, but "did we learn from it"? I am doubtful about that, given that even after running with the same prompt 10 times, it will give different answers in 8/10 responses.
But I am very confident that I can learn by iterating and printing designs on a 3D printer.
For example I wanted to add a rate-limiter to an api call with proper http codes, etc. I asked the ai (in IntelliJ it used to be Claude by default but they've since switched to Gemini as default) to generate one for me. The first version was not good so I asked it to do it again but with some changes.
What would take me a couple of hours or more took less than 10 minutes.
I’m starting to believe that people who think AI-generated code is garbage actually don’t know how to code.
I hit about 10 years of coding experience right before AI hit the scene, which I guess makes me lucky. I know, with high confidence, what I want my code to look like, and I make the AI do it. And it does it damn well and damn fast.
I think I sit at a unique point for leveraging AI best. Too junior and you create “working monsters.” Meanwhile, Engineering Managers and Directors treat it like humans, but it’s not AGI yet.
I use a web ui to chat with ai and do research, and even then I sometimes have to give up and accept that it won't provide the best solution that I know exists and am just to lazy to flesh out on my own. And to the official docs I go.
But the coding tools, I'm sorry but they constantly disappoint me. Especially the agents. In fact the agents fucking scare me. Thank god copilot prompts me before running a terminal command. The other day I asked it about a cypress test function and the agent asked if it could run some completely unrelated gibberish python code in my terminal. That's just one of many weird things it's done.
My colleagues vibe code things because they don't have experience in the tech we use on our project, it gets passed to me to review with "I hope you understand this". Our manager doesn't care because he's all in on AI and just wants the project to meet deadlines because he's scared for his job, and each level up the org chart from him it's the same. If this is what software development is now then I need to find another career because its pathetic, boring, and stressful for anyone with integrity.
How AI assistance impacts the formation of coding skills
[1] plug: this is a video about the Patreon community I founded to do exactly that. Just want to make sure you’re aware that’s the pitch before you do ahead and watch.
This study is so bad, the sample size is n = 52 and then in some conclusions it goes down to n = 2.