cross-posted from: https://programming.dev/post/36289727

Comments

Our findings reveal a robustness gap for LLMs in medical reasoning, demonstrating that evaluating these systems requires looking beyond standard accuracy metrics to assess their true reasoning capabilities.6 When forced to reason beyond familiar answer patterns, all models demonstrate declines in accuracy, challenging claims of artificial intelligence’s readiness for autonomous clinical deployment.

A system dropping from 80% to 42% accuracy when confronted with a pattern disruption would be unreliable in clinical settings, where novel presentations are common. The results suggest that these systems are more brittle than their benchmark scores suggest.

  • njm1314@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    8 months ago

    Well it doesn’t matter what they are designed for that’s what is what they are being marketed for. That’s what you have to test.

      • njm1314@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        8 months ago

        Clearly you don’t do both. Because in the previous comment you were complaining about people judging them based upon what they are marketed as. You can’t have it both ways.