From 28ccfbfc51068a663a80764e14074df5edf2b5ba Mon Sep 17 00:00:00 2001
From: kongdeqiang <kongdeqiang960204@163.com>
Date: 星期五, 13 三月 2026 17:41:41 +0800
Subject: [PATCH] 提交
---
runtime/triton_gpu/README.md | 98 +++++++++++++++++++++++-------------------------
1 files changed, 47 insertions(+), 51 deletions(-)
diff --git a/runtime/triton_gpu/README.md b/runtime/triton_gpu/README.md
index 48e889c..36bb8f6 100644
--- a/runtime/triton_gpu/README.md
+++ b/runtime/triton_gpu/README.md
@@ -1,85 +1,81 @@
-## Inference with Triton
+## Triton Inference Serving Best Practice for SenseVoice
-### Steps:
-1. Prepare model repo files
+### Quick Start
+Directly launch the service using docker compose.
```sh
-git-lfs install
-git clone https://www.modelscope.cn/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git
-
-pretrained_model_dir=$(pwd)/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
-
-cp $pretrained_model_dir/am.mvn ./model_repo_paraformer_large_offline/feature_extractor/
-cp $pretrained_model_dir/config.yaml ./model_repo_paraformer_large_offline/feature_extractor/
-
-# Refer here to get model.onnx (https://github.com/alibaba-damo-academy/FunASR/blob/main/funasr/export/README.md)
-cp <exported_onnx_dir>/model.onnx ./model_repo_paraformer_large_offline/encoder/1/
+docker compose up --build
```
+
+### Build Image
+Build the docker image from scratch.
+```sh
+# build from scratch, cd to the parent dir of Dockerfile.server
+docker build . -f Dockerfile/Dockerfile.sensevoice -t soar97/triton-sensevoice:24.05
+```
+
+### Create Docker Container
+```sh
+your_mount_dir=/mnt:/mnt
+docker run -it --name "sensevoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-sensevoice:24.05
+```
+
+### Export SenseVoice Model to Onnx
+Please follow the official guide of FunASR to export the sensevoice onnx file. Also, you need to download the tokenizer file by yourself.
+### Launch Server
Log of directory tree:
```sh
-model_repo_paraformer_large_offline/
+model_repo_sense_voice_small
|-- encoder
| |-- 1
-| | `-- model.onnx
+| | `-- model.onnx -> /your/path/model.onnx
| `-- config.pbtxt
|-- feature_extractor
| |-- 1
| | `-- model.py
-| |-- config.pbtxt
| |-- am.mvn
+| |-- config.pbtxt
| `-- config.yaml
-|-- infer_pipeline
+|-- scoring
| |-- 1
+| | `-- model.py
+| |-- chn_jpn_yue_eng_ko_spectok.bpe.model -> /your/path/chn_jpn_yue_eng_ko_spectok.bpe.model
| `-- config.pbtxt
-`-- scoring
+`-- sensevoice
|-- 1
- | `-- model.py
`-- config.pbtxt
-8 directories, 9 files
-```
+8 directories, 10 files
-2. Follow below instructions to launch triton server
-```sh
-# using docker image Dockerfile/Dockerfile.server
-docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01
-docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/model_repo_paraformer_large_offline>:/workspace/ --shm-size 1g --net host triton-paraformer:23.01
# launch the service
-tritonserver --model-repository /workspace/model_repo_paraformer_large_offline \
+tritonserver --model-repository /workspace/model_repo_sensevoice_small \
--pinned-memory-pool-byte-size=512000000 \
--cuda-memory-pool-byte-size=0:1024000000
-
```
-### Performance benchmark
-Benchmark [speech_paraformer](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) based on Aishell1 test set with a single V100, the total audio duration is 36108.919 seconds.
-
+### Benchmark using Dataset
```sh
-# For client container:
-docker run -it --rm --name "client_test" --net host --gpus all -v <path_host/triton_gpu/client>:/workpace/ soar97/triton-k2:22.12.1 # noqa
-# For aishell manifests:
-apt-get install git-lfs
-git-lfs install
-git clone https://huggingface.co/csukuangfj/aishell-test-dev-manifests
-sudo mkdir -p /root/fangjun/open-source/icefall-aishell/egs/aishell/ASR/download/aishell
-tar xf ./aishell-test-dev-manifests/data_aishell.tar.gz -C /root/fangjun/open-source/icefall-aishell/egs/aishell/ASR/download/aishell/ # noqa
-
-serveraddr=localhost
-manifest_path=/workspace/aishell-test-dev-manifests/data/fbank/aishell_cuts_test.jsonl.gz
-num_task=60
-python3 client/decode_manifest_triton.py \
- --server-addr $serveraddr \
+git clone https://github.com/yuekaizhang/Triton-ASR-Client.git
+cd Triton-ASR-Client
+num_task=32
+python3 client.py \
+ --server-addr localhost \
+ --server-port 10086 \
+ --model-name sensevoice \
--compute-cer \
- --model-name infer_pipeline \
--num-tasks $num_task \
- --manifest-filename $manifest_path
+ --batch-size 16 \
+ --manifest-dir ./datasets/aishell1_test
```
-(Note: The service has been fully warm up.)
-|concurrent-tasks | processing time(s) | RTF |
-|----------|--------------------|------------|
-| 60 (onnx fp32) | 116.0 | 0.0032|
+Benchmark results below were based on Aishell1 test set with a single V100, the total audio duration is 36108.919 seconds.
+|concurrent-tasks | batch-size-per-task | processing time(s) | RTF |
+|----------|--------------------|------------|---------------------|
+| 32 (onnx fp32) | 16 | 67.09 | 0.0019|
+| 32 (onnx fp32) | 1 | 82.04 | 0.0023|
+
+(Note: for batch-size-per-task=1 cases, tritonserver could use dynamic batching to improve throughput.)
## Acknowledge
This part originates from NVIDIA CISI project. We also have TTS and NLP solutions deployed on triton inference server. If you are interested, please contact us.
--
Gitblit v1.9.1