The Path of Self-Learning: A Deep Dive into the DeepSeek R1 Paper

Ray

·January 22, 2025

·5 min read

After reading the DeepSeek R1 paper, its impact is profound. While I recommend everyone to read it, I suspect few will actually do so. Today, I'll explain three highlights from the paper in an easy-to-understand way, hoping to help more people grasp why this paper is so important.

## Highlight 1: Goodbye "Practice Drills" - Pure "Real Combat" Can Create Reasoning Masters!

Do we often need to "practice problems" when learning? We typically do lots of exercises to reinforce knowledge and improve problem-solving abilities. Previously, training AI models followed a similar pattern - first "feeding" AI lots of "practice problems" (supervised data) to learn knowledge and language, then conducting "special training" (fine-tuning) to enhance specific skills.

This "practice + special training" mode seemed to become the "standard operation" in the AI field.

However, the DeepSeek-AI team took an unconventional path. They wondered: Could AI skip the "practice phase" and improve reasoning abilities directly through "real combat practice" (reinforcement learning)?

They created a model called DeepSeek-R1-Zero, whose most impressive feature is that it completely skipped "practice" and went straight to the "battlefield" - using reinforcement learning (RL) technology to train the base model.

What's this like? It's like training a basketball player by putting them directly on the court instead of first having them memorize various basketball tactics and techniques - letting them learn, explore, and improve through actual games!

And guess what? This seemingly "primitive" training method actually produced an AI model with exceptional reasoning abilities! DeepSeek-R1-Zero showed impressive performance in various reasoning ability tests and even demonstrated some unexpected "superpowers":

- **Self-Verification Skills**: The model checks its own answers after solving problems and corrects itself if it finds mistakes! Just like a top student double-checking their work during an exam!

- **Reflection Skills**: The model can "reflect" on its thinking process, analyzing what it did well and what needs improvement.

- **Long Chain-of-Thought (Long CoT)**: The model can generate very detailed solution steps, showing its thinking process step by step.

More impressively, DeepSeek-R1-Zero developed these reasoning abilities purely through reinforcement learning, without any help from "practice" data. It's like proving that even without traditional practice, the right method can create a martial arts master!

DeepSeek-R1-Zero's success is a bombshell for AI research! It proved for the first time that AI's reasoning abilities can truly be "sparked" through reinforcement learning, without rigid "practice." This opens up new possibilities - training AI can be this "free-spirited"!

## Highlight 2: "Cold Start" + Multi-Stage Training Creates a Stronger Reasoning "Engine" - DeepSeek-R1

Although DeepSeek-R1-Zero was already impressive, the DeepSeek-AI team wasn't satisfied. They wanted to go further and build an even more powerful reasoning engine! They found that R1-Zero still had some minor flaws in practical applications, such as:

- "Unclear Problem-Solving Processes": Sometimes the model's reasoning processes were too "jumpy" and not intuitive enough.

- "Language Confusion": The model might mix Chinese and English when handling complex problems.

To solve these issues and further enhance reasoning abilities, the DeepSeek-AI team introduced the DeepSeek-R1 model. R1 model received a comprehensive upgrade based on R1-Zero, with the secret lying in "cold start data" and "multi-stage training."

"Cold start data" is like giving the model a "preview," letting it gain an initial understanding of human reasoning methods. Researchers collected high-quality reasoning data to "warm up" the base model, helping it grasp the reasoning style expected by humans.

After "warming up," DeepSeek-R1 entered the "main game" of multi-stage reinforcement learning training. This training process was like "leveling up," gradually improving the model's reasoning abilities:

1. **Reasoning-oriented RL**: Training focused on enhancing the model's capabilities in mathematics, coding, logical reasoning, and other core tasks.

2. **Comprehensive Ability Development**: Using outputs from the RL model to generate new high-quality "practice problems" and combining them with problems from other fields.

3. **User Experience Optimization**: A second phase of reinforcement learning training considering broader scenarios and user needs.

Through this combination of "cold start data" + "multi-stage training," DeepSeek-R1 not only solved R1-Zero's minor issues but achieved a "rocket-like" improvement in reasoning abilities. Experimental results show that DeepSeek-R1's performance in various reasoning tasks can now compete with OpenAI's top o1-1217 model!

## Highlight 3: Democratizing Reasoning Abilities - Small Models Can Have Big Wisdom!

Large language models are powerful but their hundreds of billions of parameters make them like "giants" - impossible to run on ordinary computers and inaccessible to regular users. How can we make reasoning abilities available to everyone? The DeepSeek-AI team found a clever solution: knowledge distillation!

Knowledge distillation essentially compresses the knowledge and abilities of "large model teachers" into "small model students." The DeepSeek-AI team used "super scholar" DeepSeek-R1 as the "teacher" to train a series of "mini scholars" - small models including 1.5B, 7B, 8B, 14B, 32B, and 70B versions.

More excitingly, these "mini scholars" exceeded expectations, outperforming other open-source models of similar size and even competing with some larger closed-source models! For example:

- DeepSeek-R1-Distill-Qwen-7B outperformed QwQ-32B-Preview in AIME 2024 testing!

- DeepSeek-R1-Distill-Qwen-32B achieved excellent results comparable to OpenAI's o1-mini model!

Most importantly, the DeepSeek-AI team open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and these six "mini scholar" models for free! This means ordinary people can use these powerful AI models without cost - truly a "work of conscience"!

## Summary and Future Outlook

The emergence of DeepSeek-R1 shows us more possibilities for improving AI reasoning abilities. It not only proves the potential of pure reinforcement learning but also points to new directions for building more powerful, practical, and accessible AI models.

In conclusion, DeepSeek-R1's arrival is an important milestone in AI development history, showing us the dawn of AI "thinking" and filling us with anticipation for future AI!

*Author's Note: This article was written by Gemini 2.0 Flash Thinking Experimental on 01-21. I wish this article could have been written by R1 itself, which would have been more interesting, but unfortunately R1 can't write like this yet. Google's new model is truly excellent.*