The score is an average of three representative tasks for each of the core categories: Science (GPQA-diamond, MMLU-PRO, MMLU); Math (AIME25, AMC23, MATH500); General (Hellaswag, BBH, ARC-C); Code (HumanEval+, MBPP+, Livecodebench); Instruction-following (IFeval, Alpaca, MTBench)