Inference with Triton

Steps:

Refer here to get model.onnx
Follow below instructions to using triton
```sh
using docker image Dockerfile/Dockerfile.server
docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01
docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/funasr/runtime/>:/workspace --shm-size 1g --net host triton-paraformer:23.01
inside the docker container, prepare previous exported model.onnx
mv <path_model.onnx> /workspace/triton_gpu/model_repo_paraformer_large_offline/encoder/1/

8 directories, 9 files

launch the service

tritonserver --model-repository ./model_repo_paraformer_large_offline \
--pinned-memory-pool-byte-size=512000000 \
--cuda-memory-pool-byte-size=0:1024000000

```

Performance benchmark

Benchmark speech_paraformer based on Aishell1 test set with a single V100, the total audio duration is 36108.919 seconds.

(Note: The service has been fully warm up.)
|concurrent-tasks | processing time(s) | RTF |
|----------|--------------------|------------|
| 60 (onnx fp32) | 116.0 | 0.0032|

Acknowledge

This part originates from NVIDIA CISI project. We also have TTS and NLP solutions deployed on triton inference server. If you are interested, please contact us.

Inference with Triton

Steps:

using docker image Dockerfile/Dockerfile.server

inside the docker container, prepare previous exported model.onnx

launch the service

Performance benchmark

Acknowledge