From 7498bd7388afdde8d5e6f8a4cb6aeb8be8ac60fa Mon Sep 17 00:00:00 2001
From: shixian.shi <shixian.shi@alibaba-inc.com>
Date: 星期五, 08 三月 2024 11:37:46 +0800
Subject: [PATCH] update code
---
README.md | 62 +++++++++++++++++++------------
1 files changed, 38 insertions(+), 24 deletions(-)
diff --git a/README.md b/README.md
index 18d02c3..d34249d 100644
--- a/README.md
+++ b/README.md
@@ -27,6 +27,9 @@
<a name="whats-new"></a>
## What's new:
+- 2024/03/05锛欰dded the Qwen-Audio and Qwen-Audio-Chat large-scale audio-text multimodal models, which have topped multiple audio domain leaderboards. These models support speech dialogue, [usage](examples/industrial_data_pretraining/qwen_audio).
+- 2024/03/05锛欰dded support for the Whisper-large-v3 model, a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. It can be downloaded from the[modelscope](examples/industrial_data_pretraining/whisper/demo.py), and [openai](examples/industrial_data_pretraining/whisper/demo_from_openai.py).
+- 2024/03/05: Offline File Transcription Service 4.4, Offline File Transcription Service of English 1.5锛孯eal-time Transcription Service 1.9 released锛宒ocker image supports ARM64 platform, update modelscope锛�([docs](runtime/readme.md))
- 2024/01/30锛歠unasr-1.0 has been released ([docs](https://github.com/alibaba-damo-academy/FunASR/discussions/1319))
- 2024/01/30锛歟motion recognition models are new supported. [model link](https://www.modelscope.cn/models/iic/emotion2vec_base_finetuned/summary), modified from [repo](https://github.com/ddlBoJack/emotion2vec).
- 2024/01/25: Offline File Transcription Service 4.2, Offline File Transcription Service of English 1.3 released锛宱ptimized the VAD (Voice Activity Detection) data processing method, significantly reducing peak memory usage, memory leak optimization; Real-time Transcription Service 1.7 released锛宱ptimizatized the client-side锛�([docs](runtime/readme.md))
@@ -66,19 +69,23 @@
## Model Zoo
FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the [Model License Agreement](./MODEL_LICENSE). Below are some representative models, for more models please refer to the [Model Zoo]().
-(Note: 馃 represents the Huggingface model zoo link, 猸� represents the ModelScope model zoo link)
+(Note: 猸� represents the ModelScope model zoo, 馃 represents the Huggingface model zoo, 馃崁 represents the OpenAI model zoo)
-| Model Name | Task Details | Training Data | Parameters |
-|:------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------:|:--------------------------------:|:----------:|
-| paraformer-zh <br> ([猸怾(https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) [馃]() ) | speech recognition, with timestamps, non-streaming | 60000 hours, Mandarin | 220M |
-| <nobr>paraformer-zh-online <br> ( [猸怾(https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [馃]() )</nobr> | speech recognition, streaming | 60000 hours, Mandarin | 220M |
-| paraformer-en <br> ( [猸怾(https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [馃]() ) | speech recognition, with timestamps, non-streaming | 50000 hours, English | 220M |
-| conformer-en <br> ( [猸怾(https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [馃]() ) | speech recognition, non-streaming | 50000 hours, English | 220M |
-| ct-punc <br> ( [猸怾(https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [馃]() ) | punctuation restoration | 100M, Mandarin and English | 1.1G |
-| fsmn-vad <br> ( [猸怾(https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [馃]() ) | voice activity detection | 5000 hours, Mandarin and English | 0.4M |
-| fa-zh <br> ( [猸怾(https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [馃]() ) | timestamp prediction | 5000 hours, Mandarin | 38M |
-| cam++ <br> ( [猸怾(https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [馃]() ) | speaker verification/diarization | 5000 hours | 7.2M |
+| Model Name | Task Details | Training Data | Parameters |
+|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------:|:--------------------------------:|:----------:|
+| paraformer-zh <br> ([猸怾(https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) [馃](https://huggingface.co/funasr/paraformer-tp) ) | speech recognition, with timestamps, non-streaming | 60000 hours, Mandarin | 220M |
+| <nobr>paraformer-zh-streaming <br> ( [猸怾(https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [馃](https://huggingface.co/funasr/paraformer-zh-streaming) )</nobr> | speech recognition, streaming | 60000 hours, Mandarin | 220M |
+| paraformer-en <br> ( [猸怾(https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [馃](https://huggingface.co/funasr/paraformer-en) ) | speech recognition, without timestamps, non-streaming | 50000 hours, English | 220M |
+| conformer-en <br> ( [猸怾(https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [馃](https://huggingface.co/funasr/conformer-en) ) | speech recognition, non-streaming | 50000 hours, English | 220M |
+| ct-punc <br> ( [猸怾(https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [馃](https://huggingface.co/funasr/ct-punc) ) | punctuation restoration | 100M, Mandarin and English | 1.1G |
+| fsmn-vad <br> ( [猸怾(https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [馃](https://huggingface.co/funasr/fsmn-vad) ) | voice activity detection | 5000 hours, Mandarin and English | 0.4M |
+| fa-zh <br> ( [猸怾(https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [馃](https://huggingface.co/funasr/fa-zh) ) | timestamp prediction | 5000 hours, Mandarin | 38M |
+| cam++ <br> ( [猸怾(https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [馃](https://huggingface.co/funasr/campplus) ) | speaker verification/diarization | 5000 hours | 7.2M |
+| Whisper-large-v2 <br> ([猸怾(https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary) [馃崁](https://github.com/openai/whisper) ) | speech recognition, with timestamps, non-streaming | multilingual | 1.5G |
+| Whisper-large-v3 <br> ([猸怾(https://www.modelscope.cn/models/iic/Whisper-large-v3/summary) [馃崁](https://github.com/openai/whisper) ) | speech recognition, with timestamps, non-streaming | multilingual | 1.5G |
+| Qwen-Audio <br> ([猸怾(examples/industrial_data_pretraining/qwen_audio/demo.py) [馃](https://huggingface.co/Qwen/Qwen-Audio) ) | audio-text multimodal models (pretraining) | multilingual | 8B |
+| Qwen-Audio-Chat <br> ([猸怾(examples/industrial_data_pretraining/qwen_audio/demo_chat.py) [馃](https://huggingface.co/Qwen/Qwen-Audio-Chat) ) | audio-text multimodal models (chat) | multilingual | 8B |
@@ -95,7 +102,7 @@
### Command-line usage
```shell
-funasr +model=paraformer-zh +vad_model="fsmn-vad" +punc_model="ct-punc" +input=asr_example_zh.wav
+funasr ++model=paraformer-zh ++vad_model="fsmn-vad" ++punc_model="ct-punc" ++input=asr_example_zh.wav
```
Notes: Support recognition of single audio file, as well as file list in Kaldi-style wav.scp format: `wav_id wav_pat`
@@ -105,17 +112,15 @@
from funasr import AutoModel
# paraformer-zh is a multi-functional asr model
# use vad, punc, spk or not as you need
-model = AutoModel(model="paraformer-zh", model_revision="v2.0.4",
- vad_model="fsmn-vad", vad_model_revision="v2.0.4",
- punc_model="ct-punc-c", punc_model_revision="v2.0.4",
- # spk_model="cam++", spk_model_revision="v2.0.2",
+model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc-c",
+ # spk_model="cam++",
)
res = model.generate(input=f"{model.model_path}/example/asr_example.wav",
batch_size_s=300,
hotword='榄旀惌')
print(res)
```
-Note: `model_hub`: represents the model repository, `ms` stands for selecting ModelScope download, `hf` stands for selecting Huggingface download.
+Note: `hub`: represents the model repository, `ms` stands for selecting ModelScope download, `hf` stands for selecting Huggingface download.
### Speech Recognition (Streaming)
```python
@@ -125,7 +130,7 @@
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
-model = AutoModel(model="paraformer-zh-streaming", model_revision="v2.0.4")
+model = AutoModel(model="paraformer-zh-streaming")
import soundfile
import os
@@ -144,21 +149,23 @@
```
Note: `chunk_size` is the configuration for streaming latency.` [0,10,5]` indicates that the real-time display granularity is `10*60=600ms`, and the lookahead information is `5*60=300ms`. Each inference input is `600ms` (sample points are `16000*0.6=960`), and the output is the corresponding text. For the last speech segment input, `is_final=True` needs to be set to force the output of the last word.
-### Voice Activity Detection (streaming)
+### Voice Activity Detection (Non-Streaming)
```python
from funasr import AutoModel
-model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
+model = AutoModel(model="fsmn-vad")
wav_file = f"{model.model_path}/example/asr_example.wav"
res = model.generate(input=wav_file)
print(res)
```
-### Voice Activity Detection (Non-streaming)
+Note: The output format of the VAD model is: `[[beg1, end1], [beg2, end2], ..., [begN, endN]]`, where `begN/endN` indicates the starting/ending point of the `N-th` valid audio segment, measured in milliseconds.
+
+### Voice Activity Detection (Streaming)
```python
from funasr import AutoModel
chunk_size = 200 # ms
-model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
+model = AutoModel(model="fsmn-vad")
import soundfile
@@ -175,11 +182,18 @@
if len(res[0]["value"]):
print(res)
```
+Note: The output format for the streaming VAD model can be one of four scenarios:
+- `[[beg1, end1], [beg2, end2], .., [begN, endN]]`锛歍he same as the offline VAD output result mentioned above.
+- `[[beg, -1]]`锛欼ndicates that only a starting point has been detected.
+- `[[-1, end]]`锛欼ndicates that only an ending point has been detected.
+- `[]`锛欼ndicates that neither a starting point nor an ending point has been detected.
+
+The output is measured in milliseconds and represents the absolute time from the starting point.
### Punctuation Restoration
```python
from funasr import AutoModel
-model = AutoModel(model="ct-punc", model_revision="v2.0.4")
+model = AutoModel(model="ct-punc")
res = model.generate(input="閭d粖澶╃殑浼氬氨鍒拌繖閲屽惂 happy new year 鏄庡勾瑙�")
print(res)
```
@@ -187,7 +201,7 @@
```python
from funasr import AutoModel
-model = AutoModel(model="fa-zh", model_revision="v2.0.4")
+model = AutoModel(model="fa-zh")
wav_file = f"{model.model_path}/example/asr_example.wav"
text_file = f"{model.model_path}/example/text.txt"
res = model.generate(input=(wav_file, text_file), data_type=("sound", "text"))
--
Gitblit v1.9.1