通过语音合成控制文本到语音的动态音频生成

我们传统的TTS API无法控制生成音频的语音。例如,如果你想将一段文本转换为音频,你无法对音频生成给出任何具体的指示。

通过音频聊天补全,你可以在生成音频之前给出具体的指示。这允许你告诉 API 以不同的语速、语调和口音说话。通过适当的指示,这些声音可以更加动态、自然和符合语境。

传统 TTS

传统 TTS 可以指定声音,但不能指定语调、口音或任何其他上下文音频参数。

from openai import OpenAI
client = OpenAI()

tts_text = """
Once upon a time, Leo the lion cub woke up to the smell of pancakes and scrambled eggs.
His tummy rumbled with excitement as he raced to the kitchen. Mama Lion had made a breakfast feast!
Leo gobbled up his pancakes, sipped his orange juice, and munched on some juicy berries.
"""

speech_file_path = "./sounds/default_tts.mp3"
response = client.audio.speech.create(
    model="tts-1-hd",
    voice="alloy",
    input=tts_text,
)

response.write_to_file(speech_file_path)

聊天补全 TTS

通过聊天补全,你可以在生成音频之前给出具体的指示。在下面的示例中,我们为儿童生成了具有英式口音的学习环境语音。这对于语音助手对学习体验很重要的教育应用尤其有用。

import base64

speech_file_path = "./sounds/chat_completions_tts.mp3"
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "mp3"},
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that can generate audio from text. Speak in a British accent and enunciate like you're talking to a child.",
        },
        {
            "role": "user",
            "content": tts_text,
        }
    ],
)

mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open(speech_file_path, "wb") as f:
    f.write(mp3_bytes)

speech_file_path = "./sounds/chat_completions_tts_fast.mp3"
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "mp3"},
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that can generate audio from text. Speak in a British accent and speak really fast.",
        },
        {
            "role": "user",
            "content": tts_text,
        }
    ],
)

mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open(speech_file_path, "wb") as f:
    f.write(mp3_bytes)

聊天补全多语言 TTS

我们还可以生成不同语言口音的音频。在下面的示例中,我们生成了具有乌拉圭西班牙语口音的音频。

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are an expert translator. Translate any text given into Spanish like you are from Uruguay.",
        },
        {
            "role": "user",
            "content": tts_text,
        }
    ],
)
translated_text = completion.choices[0].message.content
print(translated_text)

speech_file_path = "./sounds/chat_completions_tts_es_uy.mp3"
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "mp3"},
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that can generate audio from text. Speak any text that you receive in a Uruguayan spanish accent and more slowly.",
        },
        {
            "role": "user",
            "content": translated_text,
        }
    ],
)

mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open(speech_file_path, "wb") as f:
    f.write(mp3_bytes)
Había una vez un leoncito llamado Leo que se despertó con el aroma de panqueques y huevos revueltos. Su pancita gruñía de emoción mientras corría hacia la cocina. ¡Mamá León había preparado un festín de desayuno! Leo devoró sus panqueques, sorbió su jugo de naranja y mordisqueó algunas bayas jugosas.

结论

控制生成音频的语音的能力为更丰富的音频体验打开了许多可能性。有许多用例,例如:

  • 增强的表现力:可控 TTS 允许调整语调、音高、语速和情感,使声音能够传达不同的情绪(例如,兴奋、平静、紧迫)。
  • 语言学习和教育:可控 TTS 可以模仿口音、语调和发音,这对于语言学习者和教育应用来说是有益的,在这些应用中,准确的语调和重音至关重要。
  • 情境化语音:可控 TTS 可以根据内容的上下文调整语音,例如专业文档的正式语调或社交互动中的友好、对话风格。这有助于在虚拟助手和聊天机器人中创建更自然的对话。