From 554a99600c5a7a747f5aacb48eafec8bf5b1db41 Mon Sep 17 00:00:00 2001
From: root <zhangyuekai@foxmail.com>
Date: 星期一, 27 二月 2023 18:14:55 +0800
Subject: [PATCH] add README

---
 funasr/runtime/triton_gpu/README.md |   52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/funasr/runtime/triton_gpu/README.md b/funasr/runtime/triton_gpu/README.md
new file mode 100644
index 0000000..ebaa819
--- /dev/null
+++ b/funasr/runtime/triton_gpu/README.md
@@ -0,0 +1,52 @@
+## Inference with Triton 
+
+### Steps:
+1. Refer here to [get model.onnx](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime#steps)
+
+2. Follow below instructions to using triton
+```sh
+# using docker image Dockerfile/Dockerfile.server
+docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01 
+docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/funasr/runtime/>:/workspace --shm-size 1g --net host triton-paraformer:23.01 
+# inside the docker container, prepare previous exported model.onnx
+mv <path_model.onnx> /workspace/triton_gpu/model_repo_paraformer_large_offline/encoder/1/
+
+model_repo_paraformer_large_offline/
+|-- encoder
+|   |-- 1
+|   |   `-- model.onnx
+|   `-- config.pbtxt
+|-- feature_extractor
+|   |-- 1
+|   |   `-- model.py
+|   |-- config.pbtxt
+|   `-- config.yaml
+|-- infer_pipeline
+|   |-- 1
+|   `-- config.pbtxt
+`-- scoring
+    |-- 1
+    |   `-- model.py
+    |-- config.pbtxt
+    `-- token_list.pkl
+
+8 directories, 9 files
+
+# launch the service 
+tritonserver --model-repository ./model_repo_paraformer_large_offline \
+             --pinned-memory-pool-byte-size=512000000 \
+             --cuda-memory-pool-byte-size=0:1024000000
+
+```
+
+### Performance benchmark
+
+Benchmark [speech_paraformer](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) based on Aishell1 test set with a single V100, the total audio duration is 36108.919 seconds.
+
+(Note: The service has been fully warm up.)
+|concurrent-tasks | processing time(s) | RTF |
+|----------|--------------------|------------|
+| 60 (onnx fp32)                | 116.0 | 0.0032|
+
+## Acknowledge
+This part originates from NVIDIA CISI project. We also have TTS and NLP solutions deployed on triton inference server. If you are interested, please contact us.
\ No newline at end of file

--
Gitblit v1.9.1