From 554a99600c5a7a747f5aacb48eafec8bf5b1db41 Mon Sep 17 00:00:00 2001 From: root <zhangyuekai@foxmail.com> Date: 星期一, 27 二月 2023 18:14:55 +0800 Subject: [PATCH] add README --- funasr/runtime/triton_gpu/README.md | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 52 insertions(+), 0 deletions(-) diff --git a/funasr/runtime/triton_gpu/README.md b/funasr/runtime/triton_gpu/README.md new file mode 100644 index 0000000..ebaa819 --- /dev/null +++ b/funasr/runtime/triton_gpu/README.md @@ -0,0 +1,52 @@ +## Inference with Triton + +### Steps: +1. Refer here to [get model.onnx](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime#steps) + +2. Follow below instructions to using triton +```sh +# using docker image Dockerfile/Dockerfile.server +docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01 +docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/funasr/runtime/>:/workspace --shm-size 1g --net host triton-paraformer:23.01 +# inside the docker container, prepare previous exported model.onnx +mv <path_model.onnx> /workspace/triton_gpu/model_repo_paraformer_large_offline/encoder/1/ + +model_repo_paraformer_large_offline/ +|-- encoder +| |-- 1 +| | `-- model.onnx +| `-- config.pbtxt +|-- feature_extractor +| |-- 1 +| | `-- model.py +| |-- config.pbtxt +| `-- config.yaml +|-- infer_pipeline +| |-- 1 +| `-- config.pbtxt +`-- scoring + |-- 1 + | `-- model.py + |-- config.pbtxt + `-- token_list.pkl + +8 directories, 9 files + +# launch the service +tritonserver --model-repository ./model_repo_paraformer_large_offline \ + --pinned-memory-pool-byte-size=512000000 \ + --cuda-memory-pool-byte-size=0:1024000000 + +``` + +### Performance benchmark + +Benchmark [speech_paraformer](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) based on Aishell1 test set with a single V100, the total audio duration is 36108.919 seconds. + +(Note: The service has been fully warm up.) +|concurrent-tasks | processing time(s) | RTF | +|----------|--------------------|------------| +| 60 (onnx fp32) | 116.0 | 0.0032| + +## Acknowledge +This part originates from NVIDIA CISI project. We also have TTS and NLP solutions deployed on triton inference server. If you are interested, please contact us. \ No newline at end of file -- Gitblit v1.9.1