1 pointby gpvos7 hours ago1 comment
  • noemit7 hours ago
    The underlying data looks scarce. If there's only a few questions per "category" of bullshit they can easily be gamed to favor one model over another.