本章是整个系统设计的蓝图。与传统的基于 HTTP 请求/响应的语音助手不同,基于 OpenAI Realtime API 的座舱系统是一个全双工(Full-Duplex)、事件驱动(Event-Driven)且端云深度协同的复杂分布式系统。
在本章中,我们将系统解构为三个物理与逻辑平面:车端感知执行平面(Vehicle Edge)、实时连接平面(Connectivity Layer)和云端认知编排平面(Cloud Cognitive Layer)。不仅要解决“如何对话”的问题,更要解决在 4G/5G 网络抖动下如何保持连接、如何处理多模态数据的高吞吐量、以及如何确保云端的大模型不会因为“幻觉”而威胁行车安全。
学习目标:
我们遵循以下核心设计原则:
[ User / Cabin Environment ]
^ | |
(Audio)| |(Img) |(Touch)
v v v
+-----------------------------------------------------------------------------------+
| Vehicle Edge (Client Zone) - Trust Level: High (ASIL-B/QM) |
+-----------------------------------------------------------------------------------+
| 1. Perception Layer (Sensors) |
| +-------------+ +------------------+ +-------------------+ |
| | Mic Array | | Cameras (DMS/7V) | | Screen/IVI System | |
| +-[DSP/AEC]---+ +--[FrameSampler]--+ +-[AccessService]---+ |
| | | | |
| v v v |
| 2. Realtime Client Adapter (Mediator) |
| +-----------------------------------------------------------------------+ |
| | Session Manager (Auth, Reconnect, State Sync) | |
| | +-----------------------+ +-------------------------------------+ | |
| | | Audio Stream (WebRTC) | | Control/Event Channel (WebSocket) | | |
| | +-----------------------+ +-------------------------------------+ | |
| +-----------------------------------------------------------------------+ |
| ^ |
| | (Local Fallback & Feedback) |
| 3. Execution Layer (Actuators) v |
| +---------------------+ +---------------------+ |
| | Vehicle Control GW | <--> | Local TTS/Player | |
| | (Safety Check/CAN) | | (Buffer Mgmt) | |
| +---------------------+ +---------------------+ |
+-------------+-----------------------+---------------------------------------------+
| (RTP Media) | (JSON Events / Tool Calls)
| |
[ Secure Network Tunnel (TLS 1.3 / mTLS) ]
| |
+-------------+-----------------------+---------------------------------------------+
| Cloud / Edge (Server Zone) - Trust Level: Zero Trust |
+-----------------------------------------------------------------------------------+
| 4. Conversation Layer (OpenAI Realtime API) |
| +-------------------------------------------------------------+ |
| | Model Inference (GPT-4o class) | |
| | [VAD] -> [STT] -> [Reasoning] -> [TTS] | |
| | ^ | (Function Call Intent) | |
| | | (Context) v | |
| +--+----------------------------------------------------------+ |
| | Events |
| v |
| 5. Orchestration Layer (Agents SDK Runtime) |
| +-------------------------------------------------------------+ |
| | Main Orchestrator (Router) | |
| | +-------------+ +-------------+ +-------------+ | |
| | | CarCtrl Agt | | KnowledgeAg | | GUI Auto Agt| | |
| | +-------------+ +-------------+ +-------------+ | |
| +--------+----------------+-------------------+---------------+ |
| | | | |
| [Vehicle] [Vector DB] [App API/Map] |
| [State DB] (RAG) (Services) |
+-----------------------------------------------------------------------------------+
这是驻留在车机(IVI, Android/QNX)上的核心服务。
OpenAI Realtime API 是系统的“交互界面”。
threshold,避免风噪误触发打断。tool_calls 事件。这是业务逻辑的大脑,通常部署在车厂的私有云或 Serverless 环境中。
tool_calls。根据函数名分发给下游 Agent。这是最后一道防线。
清晰的数据定义是解耦的关键。
除了对话历史,Context 还包含车辆环境快照,每次对话轮次更新:
{
"session_id": "sess_abc123",
"vehicle_context": {
"speed": 85.0,
"gear": "D",
"passengers": ["driver", "rear_right"],
"location": {"lat": 31.2, "lng": 121.4},
"screen_activity": "com.tesla.nav"
}
}
Rule of Thumb:不要把所有车辆信号都塞进 Context,只放跟交互决策强相关的(如速度、乘客、当前APP)。
Realtime API 发出的结构 vs. Agents SDK 返回的结构:
Request (From Realtime API):
{
"type": "function",
"name": "adjust_climate",
"arguments": "{ \"temperature\": 22, \"zone\": \"driver\" }"
}
Response (From Agents SDK):
{
"tool_call_id": "call_xyz",
"status": "success",
"output": "驾驶席温度已设定为22度。当前外部温度较低,建议开启座椅加热。",
"visual_feedback": { "widget_id": "climate_toast", "duration": 3000 }
}
注意:返回不仅包含文本结果,还可能包含 visual_feedback 用于在车机屏幕上弹窗。
User Mic/VAD RealtimeAPI AgentsSDK VehicleGW
| | | | |
|(TTS playing "The weather is...") | |
| | | | |
|"Too loud!"| | | |
|---------->| [Detect] | | |
| |--Truncate->| | |
| |--AudioOpus>| | |
| | | [CancelTTS] | |
| | | [Infer] | |
| | |<ToolCall> | |
| | |--set_vol--->| |
| | | | [Check] |
| | | |--Cmd------>|
| | | | |[Set Vol]
| | | |<--Ack------|
| | |<--Result----| |
| | | [Gen Resp] | |
|<--Audio ("OK")---------| | |
为了满足高可用性:
本章构建了一个深度融合的端云混合架构。
这种架构将不确定的 AI 生成能力(Probabilistic)与确定的车辆控制要求(Deterministic)通过明确的接口契约结合在了一起。
Server VAD 事件日志。如果在 AI 说话期间 VAD 频繁触发,说明 AEC 有问题。conversation.item.truncate 事件来精确处理打断逻辑,而不是仅仅依赖音频流。open_window 工具调用正在执行,close_window 紧随其后。车端硬件可能无法处理这种快速反转。