Wan 2.1 AI Video Generator is now arrived. Learn more Wan AI.
Chooat

DeepSeek-R1 | RL-Enhanced Reasoning LLM

DeepSeek-AI has launched its first-generation reasoning models, DeepSeek-R1 and DeepSeek-R1-Zero. By leveraging Reinforcement Learning (RL), cold-start data, and distillation techniques, these models significantly enhance the reasoning capabilities of large language models (LLMs) and have achieved outstanding performance across multiple reasoning benchmarks.

LLMs have made remarkable progress in natural language processing (NLP), excelling in understanding, generation, and reasoning tasks. However, several challenges remain. Developing robust reasoning capabilities often requires extensive supervised fine-tuning, limiting scalability and generalization. Issues like low readability and the trade-off between computational efficiency and reasoning complexity persist.

To address these challenges, DeepSeek-AI has introduced the DeepSeek-R1 model, which integrates RL to enhance reasoning. This work introduces two innovative models:

These models combine advanced RL techniques with structured training methodologies to overcome existing limitations, offering scalability and usability.


Figure 1 | Benchmark performance of DeepSeek-R1.

Reinforcement Learning for Reasoning Tasks

DeepSeek-R1-Zero

This model relies on RL without using supervised data. Leveraging Group Relative Policy Optimization (GRPO), it optimizes reasoning by evaluating multiple outputs, leading to significant improvements in benchmark performance. For example, its AIME 2024 pass@1 score increased from 15.6% to 71.0% during training.

Multi-Stage Training in DeepSeek-R1

DeepSeek-R1 integrates cold-start data—thousands of carefully curated CoT examples—for pre-fine-tuning the base model before conducting reasoning-focused RL. This process ensures that outputs are both coherent and user-friendly by incorporating language consistency rewards.

Model Distillation for Efficiency

To address computational constraints, DeepSeek-AI distilled six smaller models (ranging from 1.5B to 70B parameters) from DeepSeek-R1 using Qwen and Llama architectures. These distilled models maintain strong reasoning capabilities. For instance, the 14B distilled model achieved a 69.7% pass@1 score on AIME 2024, outperforming some larger models.


Performance Highlights

Reasoning Benchmarks

Coding and STEM Tasks

General Capabilities

Distilled Model Highlights


Figure 2 | AlME accuracy of DeepSeek-R1-Zero during training. For each question, we sample16 responses and calculate the overall average accuracy to ensure a stable evaluation.

A Significant Step in LLM Reasoning

DeepSeek-AI’s DeepSeek-R1 and DeepSeek-R1-Zero represent a significant advancement in enhancing the reasoning capabilities of LLMs. By utilizing RL, cold-start data, and distillation techniques, these models address critical limitations while promoting accessibility through open-source availability under the MIT license.

The API (model=deepseek-reasoner) further enhances usability for developers and researchers. Looking ahead, DeepSeek-AI aims to:

These efforts aim to establish DeepSeek-R1 as a powerful solution for reasoning-focused AI applications.

By adopting thoughtful training paradigms, DeepSeek-R1 demonstrates how AI can evolve to tackle increasingly complex challenges.


Try it now: deepseek-r1