A second path on MedBounty alongside submitting a bounty: pre-banked clinical questions where the latest thinking models from OpenAI, Anthropic, and Google gave meaningfully different answers — and you decide which is best, rate each, and write a short rubric for the right answer.
Each question is run through three frontier thinking models. An LLM judge scores disagreement on management, safety, factual, and coverage axes. Only the hardest divergences are surfaced.
Pick which model's answer is strongest, rate each on a 1–5 scale, and surface the gaps the others missed. Your judgment becomes the labeled ground truth.
List 2+ requirements a correct answer must include and 1+ negative scoring items that disqualify a wrong one. Add sources and a short reasoning trace. That's the bounty.