v1: Basic Five-Gate protocol — rejection rate, self-correction behavior v2: Confidence decay model — C(t) = C0 × e^(-λ × (β+1)/(α+1) × t) v3: Phase 5 enhancements — entities tagging, search-driven retrieval, hard/soft signal distinction
v3 passed 6/6 verification points. The highlight was T3.3: a user claimed incorrect enum values, and Gate 2 correctly rejected them because they contradicted SQL-verified data already in the knowledge base. The system defended its own knowledge integrity against incorrect human input. v3 also found a critical bug in the inject command — it silently wrote to the wrong path when given a relative --target. Fixed, patched, and verified. What the computation layer looks like now: The confidence math no longer lives in SKILL.md prompts (which caused LLM calculation errors). It's been moved to a Python tool layer: C(t) = C0 × e^( -λ_base × (β+1)/(α+1) × t ) 143 pytest cases passing. LLM judges, Python computes. What this is not: This isn't a finished product. The knowledge base for my test domain has 8 entries after 17 tasks — deliberately sparse. The design philosophy is that a mature Skill that stops growing is healthy, not stalled. Convergence is the goal. What I'm still iterating on: λ calibration across domains requires a second experiment (pending data ethics clearance on production data), the α/β upper bound question is open, and protocol compliance still depends on LLM discipline with no mechanical enforcement. Repo: github.com/191341025/Self-Evolving-Skill Still building. Feedback welcome.