传统导航系统,如基于 SLAM 的规划器,在结构化环境中表现出卓越的几何精度和可预测性,但它们是“听不懂人话”的执行者,无法理解“把桌上那杯快凉的咖啡拿过来”这类充满常识和上下文的指令。另一极端,纯端到端的视觉-语言-行动(VLA)模型,虽然能直接从像素和语言映射到动作,展现了惊人的泛化能力,但其“黑箱”特性、高昂的计算成本和毫秒级延迟的缺失,使其在动态、安全攸关的物理世界中显得脆弱和不可靠。本章将深入探讨一种VLA 大-小脑(Big Brain - Cerebellum)混合导航架构,这是一种务实且大的范式,旨在融合二者之长。我们将复杂的、需要世界知识和长期推理的感知-语言-规划任务分配给“大脑”,而将快速、确定性、安全攸关的反射与稳定控制任务交给“小脑”。学习本章后,您将能:
大-小脑架构的核心哲学是计算与时间的解耦。它承认不同任务对计算资源和响应时间的要求存在巨大差异,通过分层来优化系统总体的效率与安全性。
List[Image])String)Dict[status, message])List[Contextual_Info])Map_Data){
"plan_id": "plan-123",
"overall_goal": "fetch the apple from the kitchen",
"subgoals": [
{
"id": 0,
"skill": "NAVIGATE_TO",
"params": {"target_description": "the kitchen door"},
"dependencies": []
},
{
"id": 1,
"skill": "PASS_THROUGH",
"params": {"door_id": "kitchen_door_01"},
"dependencies": [0]
},
// ... more subgoals
]
}
NAVIGATE_TO 技能可以由一个模型预测控制(MPC)或动态窗口法(DWA)的局部规划器实现。AVOID_OBSTACLE 是一个持续运行的、基于深度或光流的碰撞检测模块。PASS_THROUGH 可能是一个精心设计的状态机,结合了定位、对准、速度控制等步骤。Dict[skill, params])cmd_vel = {vx, vy, vω}。数据流与交互机制
+------------------------------------+ Subgoal API (JSON) +-------------------------------+
| 大脑 (VLA) - Cognitive Core | (e.g., NAVIGATE_TO(kitchen)) | 小脑 (Cerebellum) |
| - Multi-modal Perception |----------------------------->| - High-freq State Estimation |
| - Language Grounding | | - Local Planner (DWA/MPC) |
| - Chain-of-Thought Planning | | - Collision Avoidance |
| (Slow, ~1Hz, High Latency) | | - Motor Control Interface |
+-----------------+------------------+ Feedback Channel +----------+--------------------+
^ | (JSON: {status, reason}) |
| | Memory Bus (Vector DB) <----------------------------+ | High Freq Action
| v | | (~50Hz)
+-----------------+------------------+ (Success, Failure, Blocked) | v
| Memory System (Long/Short Term) | <--------------------------------+ +-----------+
| - Scene Graph, Vector Embeddings | | Actuators |
+------------------------------------+ +-----------+
^
| Multi-modal Observations (Images, Text, Map, State)
+-------------------------------------------------------------------------+
Rule-of-Thumb: 大脑和小脑的时间尺度分离是架构成功的关键。大脑的输出必须在时间上是持久的(一个子目标会持续数秒到数分钟),而小脑的决策是瞬时的。这种分离使得昂贵的推理计算可以被摊销,同时保证了系统的实时响应能力。
这是大脑的核心功能,它将模糊的人类语言转化为机器可以精确执行的步骤。这个过程可细分为三个阶段:
(x, y)。Prompt: You are a helpful robot assistant. Your available skills are [NAVIGATE_TO, PICK, PLACE, ...]. Given the user request: "Please take the empty can from the coffee table and throw it in the trash bin under the sink." Decompose this into a sequence of executable subgoals in JSON format.
... [CoT] ... I need to first go to the coffee table, then pick up the can, then go to the sink, then place the can in the trash bin.
... [JSON Output] ...
NAVIGATE_TO 技能需要一个目标参数。这个参数可能来自接地阶段的结果(如 coffee_table 的坐标)或常识推理(trash_bin 通常在 sink 附近)。PICK 技能需要被操作物体的 3D 位置和姿态,这需要视觉系统进行精确估计。子目标技能 API 设计 小脑的健壮性很大程度上取决于这套 API 的设计。好的 API 应该是:
| 技能 (Skill) | 参数 (Parameters) | 返回值 (Feedback) |
|---|---|---|
NAVIGATE_TO |
target: Union[Pose, SemanticID, Region] |
SUCCESS, FAILURE_UNREACHABLE, FAILURE_BLOCKED |
PICK |
object_id: String, grasp_pose: Pose |
SUCCESS, FAILURE_NOT_FOUND, FAILURE_GRASP_FAILED |
WAIT |
duration: float or condition: Event |
SUCCESS_TIME_ELAPSED, SUCCESS_CONDITION_MET |
SEARCH_FOR |
object_desc: String, search_area: Region |
SUCCESS_FOUND(object_id, pose), FAILURE_NOT_FOUND |
一个没有记忆的大脑是短视的,无法执行需要长期上下文的复杂任务。记忆系统是大脑的外挂“硬盘”,提供持续的世界认知。
in_room, on_top_of, connected_to)。每个节点都附有丰富的属性,如 3D 位置、尺寸、颜色、语义类别和状态(is_open, is_full)。该图由一个后台的建图和感知模块(如 Chapter 9 的 OCC 方法 + 3D 目标检测)持续、异步地更新。(timestamp, speaker, utterance/action, result) 的形式记录所有的交互历史。这对于解决指代消解(如“把它拿过来”中的“它”)和理解多轮指令至关重要。[NAVIGATE_TO(back_door), FAILURE_BLOCKED, Reason: "large box detected"]。当大脑未来生成相似计划时,可以检索这些经验来规避已知的错误。Rule-of-Thumb (1B vs 10B): 10B 模型可以处理更长、更“原始”的上下文,你可以将检索到的多段记忆直接拼接到 prompt 中。而 1B 模型上下文窗口有限,对噪声更敏感,最好先用一个小型模型(如 T5)对检索结果进行摘要和提炼,生成一段高度浓缩的上下文信息再喂给大脑。
赋能 VLA 的同时,必须用严格的规则将其“囚禁”起来,防止其产生有害或危险的行为。这套系统就是决策护栏(Guardrails)。
NAVIGATE_TO 的目标点是否在地图的可通行区域内?PICK 的物体是否在机械臂的可达范围内?OPEN(door) 之前,机器人是否已经 AT(door)?NAVIGATE_TO 任务时,一个小孩突然跑入路径,它会立即刹车,无须等待大脑的许可。FAILURE_BLOCKED 状态,大脑接收到后将负责重新规划。 +-----------------------------+
| VLA Brain Generates Plan P |
+--------------+--------------+
|
v
+------------------------------------+-------------------------------------+
| Plan Validator: Is P physically, logically, and ethically sound? | --(No)--> Reject Plan, request re-plan from Brain with reason.
+------------------------------------+-------------------------------------+
| (Yes)
v
+------------------------------------+-------------------------------------+
| Confidence Check: Is confidence(P) > threshold? | --(No)--> Enter Clarification Mode, ask user for more info.
+------------------------------------+-------------------------------------+
| (Yes)
v
+--------------+--------------+
| Dispatch Subgoal g_i to Cerebellum |
+--------------+--------------+
|
+------------------------------------v-------------------------------------+
| Cerebellum Execution Loop (~50Hz): |
| 1. Read current subgoal g_i. |
| 2. **[HIGHEST PRIORITY]** Run safety-critical checks (e.g., collision). | --(IMMINENT DANGER)--> IMMEDIATE SAFE STOP & report failure.
| 3. Compute motor commands to progress towards g_i. |
| 4. If g_i completed, report SUCCESS. If stuck, report FAILURE. |
+--------------------------------------------------------------------------+
大-小脑架构并非空中楼阁,它完美地构建在成熟的机器人技术栈之上。VLA 扮演的是“语义决策层”,而传统/OCC 模块是其赖以生存的“几何与物理层”。
is_reachable(point), get_room_polygon("kitchen"), find_free_space_near(object)。NAVIGATE_TO 技能,其内部实现就是一个标准的局部规划器(如 DWA/TEB/MPC)。NAVIGATE_TO({x,y,θ}) 指令时,小脑只是将这个目标点作为输入,传递给它的内部规划器,后者负责生成平滑、无碰撞的 cmd_vel 指令。混合栈架构图示:
+------------------------------------------------------+
| Layer 4: Cognitive & Semantic (VLA Big Brain) | <-- "Why?" (e.g., "User wants coffee")
| - Understands language, reasons about goals. |
| - Outputs symbolic plan: [NAVIGATE_TO(kitchen),..] |
+--------------------------+---------------------------+
| Subgoal API
v
+--------------------------+---------------------------+
| Layer 3: Task Execution (Cerebellum) | <-- "What?" (e.g., Execute NAVIGATE_TO)
| - Manages state machines for skills. |
| - Calls global/local planners. |
+--------------------------+---------------------------+
| Goal Pose / Path
v
+--------------------------+---------------------------+
| Layer 2: Path & Motion Planning (Traditional) | <-- "How?" (e.g., A*, DWA)
| - Computes collision-free trajectories. |
| - Operates on costmaps. |
+--------------------------+---------------------------+
| Map Data / Costmap
v
+--------------------------+---------------------------+
| Layer 1: Geometric World Model (SLAM/OCC) | <-- "Where?" (e.g., "I am here in the map")
| - Builds and maintains map from sensor data. |
+--------------------------+---------------------------+
| Raw Sensor Data (RGB-D, IMU)
v
+------------------------------------------------------+
| Layer 0: Hardware & Sensors |
+------------------------------------------------------+
Rule-of-Thumb: 设计混合栈时,接口是关键。层与层之间应通过定义良好、版本化的 API(如 Protobuf, ROS Msgs)通信。这使得你可以独立地升级或替换任何一层(例如,用一个新的 VLA 模型替换大脑,或用一个更优的规划器优化小脑)而无需重构整个系统。
本章深入剖析了 VLA 大-小脑混合导航架构,这是一种在当前技术水平下,兼顾强大语义智能与物理世界全可靠性的领先范式。
API 粒度失配: 定义的子目标 API 要么过于抽象(如 TIDY_ROOM),小脑无法执行;要么过于具体(如 MOVE_WHEEL(rad, speed)),使大脑陷入微观管理,失去其价值。缓解策略: 设计一套“面向目标”而非“面向过程”的技能 API。每个技能应对应一个机器人原子、可靠、可独立验证的能力。API 设计应迭代进行,从最核心的 NAVIGATE, PICK, PLACE 开始,逐步扩展。
PICK 前先用摄像头确认物体还在不在)。