2 pointsby nicola_alessi5 hours ago1 comment
  • guerython5 hours ago
    Your theory matches what we've seen in production. "Save for later" is an off-policy action unless you make it a completion precondition.

    The pattern that improved compliance for us was turning memory into a required finalizer step: task is only considered done after (1) artifact output and (2) a structured observation write with a fixed schema (decision, evidence, failure, next-check). If the second step is missing, a checker agent rejects completion and asks for retry.

    Prompting alone stayed flaky. Gating plus an explicit verifier moved behavior from "optional hygiene" to "part of done". Passive extraction is still valuable, but as a safety net instead of the primary path.

    • nicola_alessi5 hours ago
      Gating completion on the observation write is smart — you're turning the model's task-completion drive against itself. Have you run into the problem where forced observations degrade to "completed task successfully, no issues noted" though? Technically passes the schema, zero actual information.

      That's what killed the approach for me. I spent weeks tuning schemas and rejection criteria and the models just got better at producing plausible-sounding observations that said nothing. Passive extraction ended up more reliable — watch the AST diffs, infer what the agent learned from what it actually changed, skip the self-report entirely.

      Curious what your checker validates against. If it's structural completeness of the fields you'll hit the gaming problem fast. If it's semantic quality... how? That's basically asking another model to judge whether an observation is useful, which is its own rabbit hole.