This kind of thing beats me. Why should a "Large Language Model" be expected to act as a calculator. Clue one is on the name, clue two might be an understanding that it is based on statistics, it is not the deterministic tool you need.
Did exactly that for the actual filing — Python, mentioned in the post. The 23 numbers were a probe, not the goal: I wanted to understand how it works.