Faster_Whisper部署教快速入门教程

alex 收录于类别 AIGC AI语音

2024-03-05 2024-03-05 约 2947 字预计阅读 13 分钟

1 Faster_Whisper优点

Faster_Whisper是基于OpenAI的Whisper模型的高效实现。主要包括以下特点：

更高效：其利用CTranslate2，一个专为Transformer模型设计的快速推理引擎。这种实现既提高了语音识别的速度，同时还优化了内存使用效率。
稳定性：Faster-Whisper的核心优势在于其能够在保持原有模型准确度的同时，大幅提升处理速度。
可用性：减少模型的层数、参数量和模型结构，这样就减少了计算量和内存消耗，并改进了推理算法和计算过程，减少了冗余计算，进一步提高了模型的运行速度。

适用场景主要包括：语音识别、语音转写、大规模语音数据处理等。在处理大规模的语音数据时，因为其高效的计算能力和优秀的内存优化，使得Faster_Whisper得到了广泛应用

2 部署

2.1 Requirements

Python 3.8 or greater

与 openai-whisper 不同，FFmpeg 不需要安装在系统上。音频是通过 Python 库 PyAV 解码的，该库在其包中捆绑了 FFmpeg 库。

2.1.1 GPU

GPU execution requires the following NVIDIA libraries to be installed:

有多种方法可以安装这些库。官方的 NVIDIA 文档描述了一种推荐的方法，但我们在下文也对其他的安装方法进行了建议。

2.2 安装

该模块可以从 PyPI 安装：pip install faster-whisper

其他安装方法包括：

安装 master 分支： pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/master.tar.gz"
安装特定提交： pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"

2.3 Faster-whisper

使用 faster_whisper 包，并指定模型尺寸。

可以在 GPU 上以 FP16 或 INT8 的方式运行，也可以在 CPU 上以 INT8 的方式运行。

使用 WhisperModel 类的 transcribe 方法进行语音转写，并打印出检测到的语言及其概率。

注意，segments 是一个生成器，只有在迭代时才开始转录。通过将 segments 收集到一个列表或 for 循环中，可以运行完整的转录。

2.4 Faster-distil-whisper

对于 faster-distil-whisper 的使用，请参考 #533。

可以选择模型尺寸，并支持在 CUDA 设备上以 float16 的方式运行。但需要注意的是，实验证明，对于较长的音频，condition_on_previous_text = True 会降低 faster-distil-whisper 的性能。

2.5 时间戳

你还可以获取单词级别的时间戳。

2.6 VAD 过滤器

该库内置了 Silero VAD 模型，以过滤掉不包含语音的音频部分。默认行为是保守的，只去除超过2秒的静音部分。你也可以通过 vad_parameters 字典参数自定义 VAD 参数。

2.7 日志

你可以配置库的日志级别。如要查看更多模型和转录选项，可以查看 WhisperModel 类的具体实现。

2.7.1 Faster-whisper

from faster_whisper import WhisperModel

model_size = "large-v3"

# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")

# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")

segments, info = model.transcribe("audio.mp3", beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Warning: segments is a generator so the transcription only starts when you iterate over it. The transcription can be run to completion by gathering the segments in a list or a for loop:

segments, _ = model.transcribe("audio.mp3")
segments = list(segments)  # The transcription will actually run here.

2.7.2 Faster-distil-whisper

For usage of faster-distil-whisper, please refer to: #533

model_size = "distil-large-v2"
# model_size = "distil-medium.en"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, 
    language="en", max_new_tokens=128, condition_on_previous_text=False)

NOTE: Empirically, condition_on_previous_text=True will degrade the performance of faster-distil-whisper for long audio. Degradation on the first chunk was observed with initial_prompt too.

2.7.3 Word-level timestamps

segments, _ = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))

2.7.4 VAD filter

The library integrates the Silero VAD model to filter out parts of the audio without speech:

segments, _ = model.transcribe("audio.mp3", vad_filter=True)

The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the source code. They can be customized with the dictionary argument vad_parameters:

segments, _ = model.transcribe(
    "audio.mp3",
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500),
)

2.7.5 Logging

The library logging level can be configured like this:

import logging

logging.basicConfig()
logging.getLogger("faster_whisper").setLevel(logging.DEBUG)

2.7.6 Going further

See more model and transcription options in the WhisperModel class implementation.

2.8 社区整合

以下是一份使用faster-whisper的开源项目的非详尽列表。欢迎将您的项目添加到列表中！

WhisperX 是一个获奖的Python库，它提供使用wav2vec2对齐的说话者识别和准确的单词级时间戳
whisper-ctranslate2 是一个基于faster-whisper并与openai/whisper的原始客户端兼容的命令行客户端。
whisper-diarize 是一个基于faster-whisper和NVIDIA NeMo的说话者分割工具。
whisper-standalone-win 是针对Windows，Linux和macOS的faster-whisper的独立命令行执行程序。
asr-sd-pipeline 提供了一个可扩展，模块化，从端到端的多说话者语音到文本解决方案，使用AzureML管道实现。
Open-Lyrics 是一个Python库，它使用faster-whisper将语音文件转录，并使用OpenAI-GPT将生成的文本翻译/润色成所需语言的.lrc文件。
wscribe 是一种支持faster-whisper的灵活的转录生成工具，它可以导出单词级别的转录，然后可以使用wscribe-editor编辑导出的转录
aTrain 是在格拉茨大学的BANDAS-Center开发的faster-whisper的图形用户界面实现，用于在Windows（Windows应用商店应用）和Linux中进行转录和分割。
Whisper-Streaming 实现了离线Whisper-like语音到文本模型的实时模式，以faster-whisper为最推荐的后端。它实现了一个基于实际源复杂性的自适应延迟的流策略，并显示了最先进的技术。
WhisperLive 是OpenAI的Whisper的近实时实现，它使用faster-whisper作为后端实时转录音频。
Faster-Whisper-Transcriber 是一个简单但可靠的语音转录器，提供了用户友好的界面。

2.9 模型转换

当根据其大小加载模型，例如 WhisperModel("large-v3")，相应的CTranslate2模型会自动从 Hugging Face Hub 下载。

我们还提供了一个脚本，可以转换与Transformers库兼容的任何Whisper模型。这些模型可能是原始的OpenAI模型或用户微调过的模型。

例如，下面的命令将转换原始的 “large-v3” Whisper模型，并以FP16保存权重：

pip install transformers[torch]>=4.23

ct2-transformers-converter --model openai/whisper-large-v3 --output_dir whisper-large-v3-ct2
--copy_files tokenizer.json preprocessor_config.json --quantization float16

The option --model accepts a model name on the Hub or a path to a model directory.
If the option --copy_files tokenizer.json is not used, the tokenizer configuration is automatically downloaded when the model is loaded later.

Models can also be converted from the code. See the conversion API.

2.10 加载转换后的模型

model = faster_whisper.WhisperModel("whisper-large-v3-ct2")

Upload your model to the Hugging Face Hub and load it from its name:

model = faster_whisper.WhisperModel("username/whisper-large-v3-ct2")

2.11 与其他实现进行性能比较

如果你正在将性能与其他 Whisper 实现进行比较，你应该确保在相似的设置下运行比较。特别是：

核实使用了相同的转录选项，尤其是相同的光束大小。例如，在 openai/whisper 中，model.transcribe 默认使用光束大小为1，但在这里我们使用默认光束大小为5。
在 CPU 上运行时，确保设置了相同数量的线程。许多框架会读取环境变量 OMP_NUM_THREADS，可以在运行脚本时设置这个变量：

OMP_NUM_THREADS=4 python3 my_script.py

3 Faster_Whisper 的缺点

当前，Faster_Whisper 是一个非常有效的语音转录库，但正如所有技术一样，每个技术可能都有其用户可能会遇到的一些限制和挑战。以下是一些可能存在的缺点：

音频质量：如果录音质量较差（例如，有背景噪声、话者离麦克风太远等），则转录效果可能会受到影响。
计算复杂度：虽然 Faster_Whisper 已经做了优化，但它在 CPU 上的运行速度可能仍然受到限制，尤其是对于较长的音频。对于没有 GPU 的用户，这可能是一个问题。
语言局限性：Faster_Whisper 可能并不能完美的支持所有语言，对于某些语言，其性能可能会下降。
需要手动迭代：segments 是一个生成器，所以只有在迭代时才开始转录，对于不熟悉 Python 编程的用户，这可能会产生一些麻烦。

4 类似Faster_Whisper 的项目

funNLP 项目，它包含了如中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表等功能。这个项目由fighting41love维护，受到了很多人的关注，具有62145个观察者和13907个Fork。

Coqui TTS是一个深度学习工具包，用于文本到语音，经过研究和生产的测试。Coqui TTS 是一个强大的开源语音合成库，它采用深度学习技术，可以实现多种自然化的语音效果。

目录

目录