Skywork’s Second-Gen Reward Models Set New Bar for RLHF—Even at Just 600M Parameters
In the relentless race to fine-tune AI using human feedback, Skywork just made a major move. On July 4, 2025, the AI research team unveiled Skywork-Reward-V2, a second-generation open-source reward model series that’s already topping charts across key industry benchmarks.
Spanning eight models—from a lean 600 million parameters to a still-svelte 8 billion—the V2 lineup is built on Qwen3 and LLaMA3 backbones. But don’t let the model size fool you: in benchmark after benchmark, these compact contenders outperformed or matched the giants, including previous 27B models and even closed-weight systems.
The Secret Weapon: 40 Million Carefully-Curated Preferences
What’s fueling this performance leap? Not new model architecture or training tricks—but better data. Skywork’s in-house dataset, Skywork-SynPref-40M, contains a staggering 40 million human preference pairs, 26 million of which were vetted through a rigorous two-stage, human-machine collaborative process.
In phase one, human annotators constructed a “gold-standard” subset—thoroughly verified for correctness, bias, and task clarity. That small seed then drove massive scale-up via LLM-assisted expansion, with models generating “silver-standard” data based on patterns learned from human-labeled samples.
Phase two went full throttle: using trained reward models to filter or regenerate data, the team screened tens of millions of samples automatically. The result is what might be the most robust open-source preference dataset to date—balancing depth, breadth, and quality.
And yes, it’s built for scale. Tests showed that just 290,000 handpicked samples (1.8% of the data) were enough to train an 8B model that beat 70B-class reward models on key benchmarks. Data quality > raw quantity.
Why This Matters: Benchmark Smashing and Transfer-Ready
Let’s talk results.
On seven leading RLHF and reward model benchmarks—including RewardBench v1/v2, PPE Preference & Correctness, RM-Bench, and JudgeBench—Skywork-Reward-V2 models consistently claimed top scores. Even the smallest model, Skywork-Reward-V2-Qwen3-0.6B, performed neck-and-neck with the top-tier Skywork-Reward-Gemma-2-27B-v0.2 from last year.
The largest V2 model (Skywork-Reward-V2-Llama-3.1-8B) swept all categories, delivering SOTA performance not just in preference alignment but also truthfulness, instruction following, bias resistance, and Best-of-N ranking tasks.
This is especially notable because current open reward models often suffer from overfitting, task-specific tuning, or poor generalization. Skywork’s models show up as all-rounders—an increasingly rare trait in a fragmented landscape of single-task reward specialists.
A New Milestone for Open RLHF Infrastructure
In the broader AI ecosystem, reward models are moving from the periphery to the core. Initially designed as scoring tools for Reinforcement Learning from Human Feedback (RLHF), they’re now pivotal in RLVR (reinforcement learning from verification and reasoning), agent-based systems, and safety-critical AI alignment pipelines.
Skywork’s vision aligns with this trajectory. Rather than treating reward models as simple feedback or evaluation modules, the team positions them as strategic navigators—guiding intelligent systems toward outcomes that reflect complex human goals, preferences, and safety criteria.
In other words: today’s reward models are tomorrow’s value alignment engines.
With over 750,000 cumulative downloads of its first-generation models since launch in late 2024, Skywork is already a key player in the open-source RLHF ecosystem. This V2 release only cements its growing influence.
Looking ahead, the team plans to expand research beyond reward model scaling—exploring alternative training objectives, architecture tweaks, and long-range preference modeling. Given how fast this space is moving, don’t be surprised if the next release challenges even closed-weight giants.
Power Tomorrow’s Intelligence — Build It with TechEdgeAI.