Is the poor performance because the LLMs are not being used for iterative refinement?
In some cases even for the agent examples I just have to assume that the AI encountered some issue applying tooling and was forced to run in text mode throughout? Unfortunately there seems to be so much missing context for the viewer of what the assignment, process, expected and resulting output are that you can only really guess at what's going on from the most outwardly bewildering (to the OP) behaviour.