bytedance_history

第7章：推荐系统架构演进

从简单的协同过滤到千亿特征的深度学习系统

章节概述

字节跳动的推荐系统是其技术核心竞争力的集中体现。从2012年今日头条上线的简单推荐算法，到支撑抖音、TikTok等产品的全球化推荐平台，这个系统经历了多次重大架构演进。本章将深入剖析这一演进过程，包括技术选型、架构设计、算法迭代以及工程优化等方面。

7.1 早期架构：从零开始的推荐系统 (2012-2014)

7.1.1 初代系统设计

2012年8月，今日头条上线时的推荐系统相对简单，但已经体现了其”无编辑”的核心理念。

┌─────────────────────────────────────────────────────────────┐
│                    今日头条 v1.0 架构                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   [客户端] ──HTTP──> [Nginx]                                 │
│                         │                                    │
│                         ↓                                    │
│                    [Django App]                              │
│                         │                                    │
│            ┌────────────┼────────────┐                       │
│            ↓            ↓            ↓                       │
│      [爬虫系统]    [推荐服务]    [用户服务]                    │
│            │            │            │                       │
│            ↓            ↓            ↓                       │
│      [MongoDB]      [Redis]      [MySQL]                     │
│                                                              │
│   技术栈：Python 2.7, Django 1.4, MongoDB 2.4                │
│   日活用户：10万                                              │
│   文章库：100万                                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

核心设计决策：

技术选型
- Python作为主开发语言：快速迭代，丰富的机器学习库
- MongoDB存储文章内容：灵活的Schema，适合新闻内容
- Redis缓存热点数据：减轻数据库压力

推荐算法v1.0

# 简化的推荐逻辑
def recommend(user_id, count=10):
    # 获取用户历史行为
    user_history = get_user_history(user_id)
       
    # 基于内容的协同过滤
    similar_items = []
    for item in user_history:
        similar = find_similar_items(item, method='tfidf')
        similar_items.extend(similar)
       
    # 去重和排序
    candidates = deduplicate(similar_items)
    scores = calculate_scores(candidates, user_history)
       
    return sorted(candidates, key=lambda x: scores[x])[:count]

7.1.2 第一次重构：应对千万级用户 (2013-2014)

随着用户量激增，原有架构面临巨大压力。王长虎带领团队进行了第一次大规模重构。

架构升级要点：

问题	解决方案	效果
单机Python性能瓶颈	引入Go语言重写核心服务	QPS提升10倍
推荐延迟高	预计算+在线计算分离	P99延迟从2s降至200ms
特征存储压力大	引入HBase分布式存储	支持亿级用户画像
模型更新慢	实现增量学习框架	模型更新从天级到小时级

┌──────────────────────────────────────────────────────────────┐
│                    推荐系统 v2.0 架构                          │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│   用户请求                                                     │
│      ↓                                                        │
│   [接入层]                                                     │
│      ↓                                                        │
│   [路由层] ←─── [A/B Test Platform]                           │
│      ↓                                                        │
│   ┌─────────────────────────────┐                            │
│   │      在线推荐服务（Go）        │                            │
│   ├─────────────────────────────┤                            │
│   │ • 召回模块（Recall）          │ ←── [离线特征HBase]         │
│   │ • 排序模块（Ranking）         │ ←── [实时特征Redis]        │
│   │ • 重排模块（Re-ranking）      │ ←── [模型服务TensorFlow]   │
│   └─────────────────────────────┘                            │
│                                                               │
│   [离线计算平台（Hadoop/Spark）]                               │
│   • 用户画像计算                                              │
│   • 内容理解（NLP/CV）                                         │
│   • 模型训练（LR/GBDT）                                        │
│                                                               │
└──────────────────────────────────────────────────────────────┘

7.2 特征工程与模型迭代

7.2.1 特征体系建设

字节的推荐系统成功很大程度上得益于其完善的特征工程体系。

三层特征架构：

┌─────────────────────────────────────────────────────────────────┐
│                        特征工程体系                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  第一层：用户特征（User Features）                                 │
│  ├─ 基础属性：年龄、性别、地域、设备                               │
│  ├─ 行为序列：点击、停留、分享、评论                               │
│  ├─ 兴趣标签：长期兴趣、短期兴趣、实时兴趣                         │
│  └─ 社交关系：关注、粉丝、互动网络                                │
│                                                                  │
│  第二层：内容特征（Item Features）                                │
│  ├─ 文本特征：标题、正文、标签、主题                               │
│  ├─ 图像特征：CNN提取的视觉特征                                   │
│  ├─ 视频特征：封面、时长、音频特征                                │
│  └─ 统计特征：CTR、完播率、互动率                                 │
│                                                                  │
│  第三层：上下文特征（Context Features）                           │
│  ├─ 时间特征：小时、工作日/周末、节假日                           │
│  ├─ 场景特征：网络环境、APP版本、入口来源                         │
│  ├─ 地理特征：GPS、城市、POI                                    │
│  └─ 会话特征：session长度、刷新次数                              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

7.2.2 特征处理流水线

实时特征计算架构：

# 特征处理流水线示例
class FeaturePipeline:
    def __init__(self):
        self.processors = [
            NumericNormalizer(),      # 数值归一化
            CategoricalEncoder(),      # 类别编码
            SequenceEncoder(),         # 序列编码
            CrossFeatureGenerator(),   # 交叉特征
            FeatureSelector()          # 特征选择
        ]
    
    def process(self, raw_features):
        features = raw_features
        for processor in self.processors:
            features = processor.transform(features)
        return features

7.2.3 模型演进路线

时期	模型	特点	核心创新
2012-2013	协同过滤 + LR	简单高效	用户-物品矩阵分解
2014-2015	GBDT + LR	特征组合能力强	自动特征交叉
2016-2017	Wide & Deep	记忆与泛化结合	端到端训练
2018-2019	DIN/DIEN	注意力机制	用户兴趣动态建模
2020-2021	MMOE	多任务学习	点击率与时长联合优化
2022-2024	Transformer	序列建模能力	长期兴趣理解

7.3 实时推荐系统设计

7.3.1 系统架构

┌──────────────────────────────────────────────────────────────────┐
│                      实时推荐系统架构                              │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│   请求入口                                                         │
│      ↓                                                           │
│   [API Gateway] ─── 限流/鉴权/路由                                │
│      ↓                                                           │
│   [Ranking Service]                                              │
│      │                                                           │
│      ├──→ [召回服务集群]                                          │
│      │     ├─ 协同过滤召回（CF Recall）                           │
│      │     ├─ 内容召回（Content Recall）                          │
│      │     ├─ 热门召回（Trending Recall）                         │
│      │     ├─ 向量召回（Vector Recall）                           │
│      │     └─ 图召回（Graph Recall）                              │
│      │                                                           │
│      ├──→ [特征服务]                                             │
│      │     ├─ 用户特征服务（User Feature Service）                 │
│      │     ├─ 物品特征服务（Item Feature Service）                 │
│      │     └─ 实时特征服务（Realtime Feature Service）            │
│      │                                                           │
│      ├──→ [预测服务]                                             │
│      │     ├─ CTR预测（Click-Through Rate）                       │
│      │     ├─ CVR预测（Conversion Rate）                          │
│      │     └─ 时长预测（Duration Prediction）                     │
│      │                                                           │
│      └──→ [业务规则]                                             │
│            ├─ 多样性控制（Diversity Control）                      │
│            ├─ 时效性保证（Freshness Guarantee）                   │
│            └─ 负反馈过滤（Negative Feedback Filter）              │
│                                                                   │
│   响应结果                                                         │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

7.3.2 召回策略

多路召回融合：

// 召回服务核心逻辑
type RecallService struct {
    strategies []RecallStrategy
}

func (rs *RecallService) Recall(ctx context.Context, req *RecallRequest) ([]*Item, error) {
    var wg sync.WaitGroup
    resultChan := make(chan []*Item, len(rs.strategies))
    
    // 并行执行多路召回
    for _, strategy := range rs.strategies {
        wg.Add(1)
        go func(s RecallStrategy) {
            defer wg.Done()
            items, err := s.Recall(ctx, req)
            if err == nil {
                resultChan <- items
            }
        }(strategy)
    }
    
    // 等待所有召回完成
    go func() {
        wg.Wait()
        close(resultChan)
    }()
    
    // 合并召回结果
    mergedItems := rs.mergeResults(resultChan)
    return mergedItems, nil
}

7.3.3 排序模型服务

模型服务架构：

┌────────────────────────────────────────────────────┐
│              模型服务架构                            │
├────────────────────────────────────────────────────┤
│                                                     │
│   [模型管理平台]                                     │
│      ↓                                             │
│   模型训练 ──→ 模型验证 ──→ 灰度发布 ──→ 全量上线      │
│                                                     │
│   [在线预测服务]                                     │
│   ├─ TensorFlow Serving (深度模型)                  │
│   ├─ XGBoost Server (树模型)                        │
│   └─ Custom Server (自定义模型)                      │
│                                                     │
│   [模型更新机制]                                     │
│   ├─ 热更新：参数级别，秒级生效                       │
│   ├─ 增量更新：小时级，在线学习                       │
│   └─ 全量更新：天级，离线重训练                       │
│                                                     │
└────────────────────────────────────────────────────┤

7.4 多目标优化与业务平衡

7.4.1 多目标建模

字节的推荐系统需要同时优化多个业务目标：

┌─────────────────────────────────────────────────────────┐
│                   多目标优化框架                          │
├─────────────────────────────────────────────────────────┤
│                                                          │
│   业务目标                     权重      优化方向         │
│   ├─ 点击率（CTR）             0.3        ↑             │
│   ├─ 停留时长（Duration）       0.25       ↑             │
│   ├─ 完播率（Completion）       0.2        ↑             │
│   ├─ 互动率（Engagement）       0.15       ↑             │
│   ├─ 负反馈率（Negative）       0.1        ↓             │
│   └─ 多样性（Diversity）        -          平衡          │
│                                                          │
│   融合公式：                                              │
│   Score = w₁·P(click) + w₂·P(duration) + w₃·P(complete)  │
│          + w₄·P(engage) - w₅·P(negative) + λ·diversity   │
│                                                          │
└─────────────────────────────────────────────────────────┘

7.4.2 MMOE架构实现

# 多目标学习模型简化实现
class MMOEModel(tf.keras.Model):
    def __init__(self, num_experts=8, num_tasks=4):
        super().__init__()
        self.num_experts = num_experts
        self.num_tasks = num_tasks
        
        # 专家网络
        self.experts = [
            self.build_expert() for _ in range(num_experts)
        ]
        
        # 门控网络
        self.gates = [
            self.build_gate() for _ in range(num_tasks)
        ]
        
        # 任务塔
        self.towers = [
            self.build_tower() for _ in range(num_tasks)
        ]
    
    def call(self, inputs):
        # 专家输出
        expert_outputs = [expert(inputs) for expert in self.experts]
        expert_outputs = tf.stack(expert_outputs, axis=1)
        
        task_outputs = []
        for i in range(self.num_tasks):
            # 门控机制
            gate_output = self.gates[i](inputs)
            gate_output = tf.nn.softmax(gate_output, axis=1)
            
            # 加权融合专家输出
            weighted_expert = tf.reduce_sum(
                expert_outputs * tf.expand_dims(gate_output, -1), 
                axis=1
            )
            
            # 任务特定输出
            task_output = self.towers[i](weighted_expert)
            task_outputs.append(task_output)
        
        return task_outputs

7.5 技术挑战与解决方案

7.5.1 规模化挑战

数据规模增长：

指标	2014年	2018年	2024年
日活用户	1000万	2亿	10亿+
日均请求	1亿	100亿	1000亿+
特征维度	10万	1000万	10亿+
模型参数	100万	10亿	1000亿+
训练数据	1TB	100TB	10PB+

解决方案架构：

┌──────────────────────────────────────────────────────────┐
│                 大规模推荐系统架构                         │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  [分布式训练平台]                                          │
│  ├─ Parameter Server架构                                 │
│  ├─ Ring-AllReduce优化                                   │
│  ├─ 混合精度训练                                         │
│  └─ 流水线并行                                           │
│                                                           │
│  [特征存储优化]                                           │
│  ├─ 分级存储：SSD + HDD + 对象存储                        │
│  ├─ 特征压缩：量化 + 稀疏化                               │
│  ├─ 缓存策略：LRU + 预取                                 │
│  └─ 分片策略：一致性哈希                                  │
│                                                           │
│  [在线服务优化]                                           │
│  ├─ 模型压缩：剪枝 + 量化 + 蒸馏                          │
│  ├─ 推理加速：TensorRT + ONNX                            │
│  ├─ 缓存优化：多级缓存架构                                │
│  └─ 负载均衡：动态路由 + 熔断降级                         │
│                                                           │
└──────────────────────────────────────────────────────────┘

7.5.2 实时性挑战

延迟优化技术栈：

// 预计算优化
type PrecomputeService struct {
    cache *Cache
    predictor *Predictor
}

func (ps *PrecomputeService) GetRecommendations(userID string) ([]*Item, error) {
    // 1. 检查预计算缓存
    if cached, exists := ps.cache.Get(userID); exists {
        return cached.([]*Item), nil
    }
    
    // 2. 实时计算兜底
    items, err := ps.predictor.Predict(userID)
    if err != nil {
        return nil, err
    }
    
    // 3. 异步更新缓存
    go ps.cache.SetAsync(userID, items, 5*time.Minute)
    
    return items, nil
}

// 并行计算优化
func ParallelScore(items []*Item, scorer Scorer) []*ScoredItem {
    numWorkers := runtime.NumCPU()
    jobs := make(chan *Item, len(items))
    results := make(chan *ScoredItem, len(items))
    
    // 启动工作协程
    for w := 0; w < numWorkers; w++ {
        go func() {
            for item := range jobs {
                score := scorer.Score(item)
                results <- &ScoredItem{Item: item, Score: score}
            }
        }()
    }
    
    // 分发任务
    for _, item := range items {
        jobs <- item
    }
    close(jobs)
    
    // 收集结果
    scoredItems := make([]*ScoredItem, 0, len(items))
    for i := 0; i < len(items); i++ {
        scoredItems = append(scoredItems, <-results)
    }
    
    return scoredItems
}

7.5.3 冷启动问题

冷启动解决方案矩阵：

场景	策略	实现方式
新用户	探索策略	ε-greedy、Thompson Sampling、UCB
新内容	流量倾斜	新内容池、boost机制
新类目	迁移学习	相似类目特征迁移
新场景	Meta学习	MAML、Prototypical Networks

# 新用户冷启动策略
class ColdStartStrategy:
    def __init__(self):
        self.explore_rate = 0.3
        self.popular_pool_size = 1000
        
    def recommend_for_new_user(self, user_profile):
        recommendations = []
        
        # 1. 基于注册信息的初始化
        if user_profile.has_demographic():
            similar_users = self.find_similar_users(user_profile)
            recommendations.extend(
                self.get_popular_among_similar(similar_users)
            )
        
        # 2. 探索性推荐
        if random.random() < self.explore_rate:
            explore_items = self.get_diverse_items()
            recommendations.extend(explore_items)
        
        # 3. 全局热门兜底
        popular_items = self.get_global_popular(self.popular_pool_size)
        recommendations.extend(popular_items)
        
        return self.diversify(recommendations)