1 pointby kingstonTime3 hours ago1 comment
  • kingstonTime3 hours ago
    Most teams treat skills, MDC rules and system prompts as write-once artifacts, refined by vibes. The post looks at two practical approaches to actually measuring whether they work: deterministic rubric testing and paired comparisons borrowed from RLHF.

    It also covers token cost as a forcing function for justifying what context stays.

    Curious whether others have built eval pipelines for their prompt context.