From 8a08405b668e06c4670b4c13f6793e193f21a21d Mon Sep 17 00:00:00 2001
From: Yabin Li <wucong.lyb@alibaba-inc.com>
Date: 星期一, 08 五月 2023 11:43:08 +0800
Subject: [PATCH] Merge branch 'main' into dev_apis

---
 egs_modelscope/speaker_diarization/TEMPLATE/README.md |   81 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 81 insertions(+), 0 deletions(-)

diff --git a/egs_modelscope/speaker_diarization/TEMPLATE/README.md b/egs_modelscope/speaker_diarization/TEMPLATE/README.md
new file mode 100644
index 0000000..ba179ed
--- /dev/null
+++ b/egs_modelscope/speaker_diarization/TEMPLATE/README.md
@@ -0,0 +1,81 @@
+# Speaker Diarization
+
+> **Note**: 
+> The modelscope pipeline supports all the models in 
+[model zoo](https://alibaba-damo-academy.github.io/FunASR/en/model_zoo/modelscope_models.html#pretrained-models-on-modelscope) 
+to inference and finetine. Here we take the model of xvector_sv as example to demonstrate the usage.
+
+## Inference with pipeline
+### Quick start
+```python
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+# initialize pipeline
+inference_diar_pipline = pipeline(
+    mode="sond_demo",
+    num_workers=0,
+    task=Tasks.speaker_diarization,
+    diar_model_config="sond.yaml",
+    model='damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch',
+    reversion="v1.0.5",
+    sv_model="damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch",
+    sv_model_revision="v1.2.2",
+)
+
+# input: a list of audio in which the first item is a speech recording to detect speakers, 
+# and the following wav file are used to extract speaker embeddings.
+audio_list = [
+    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/record.wav",
+    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk1.wav",
+    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk2.wav",
+    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk3.wav",
+    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk4.wav",
+]
+
+results = inference_diar_pipline(audio_in=audio_list)
+print(results)
+```
+
+### API-reference
+#### Define pipeline
+- `task`: `Tasks.speaker_diarization`
+- `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/model_zoo/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk
+- `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU
+- `output_dir`: `None` (Default), the output path of results if set
+- `batch_size`: `1` (Default), batch size when decoding
+- `smooth_size`: `83` (Default), the window size to perform smoothing
+- `dur_threshold`: `10` (Default), segments shorter than 100 ms will be dropped
+- `out_format`: `vad` (Default), the output format, choices `["vad", "rttm"]`. 
+  - vad format: spk1: [1.0, 3.0], [5.0, 8.0]
+  - rttm format: "SPEAKER test1 0 1.00 2.00 <NA> <NA> spk1 <NA> <NA>" and "SPEAKER test1 0 5.00 3.00 <NA> <NA> spk1 <NA> <NA>"
+
+#### Infer pipeline for speaker embedding extraction
+- `audio_in`: the input to process, which could be: 
+  - list of url: `e.g.`: waveform files at a website
+  - list of local file path: `e.g.`: path/to/a.wav
+  - ("wav.scp,speech,sound", "profile.scp,profile,kaldi_ark"): a script file of waveform files and another script file of speaker profiles (extracted with the [model](https://www.modelscope.cn/models/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/summary))
+    ```text
+    wav.scp
+    test1 path/to/enroll1.wav
+    test2 path/to/enroll2.wav
+    
+    profile.scp
+    test1 path/to/profile.ark:11
+    test2 path/to/profile.ark:234
+    ```
+    The profile.ark file contains speaker embeddings in a kaldi-like style. 
+    Please refer [README.md](../../speaker_verification/TEMPLATE/README.md) for more details.
+
+### Inference with you data
+For single input, we recommend the "list of local file path" mode for inference.
+For multiple inputs, we recommend the last mode with pre-organized wav.scp and profile.scp.
+
+### Inference with multi-threads on CPU
+We recommend the last mode with split wav.scp and profile.scp. Then, run inference for each split part.
+Please refer [README.md](../../speaker_verification/TEMPLATE/README.md) to find a similar process.
+
+### Inference with multi GPU
+Similar to CPU, please set `ngpu=1` for inference on GPU.
+Besides, you should use `CUDA_VISIBLE_DEVICES=0` to specify a GPU device.
+Please refer [README.md](../../speaker_verification/TEMPLATE/README.md) to find a similar process.

--
Gitblit v1.9.1