编辑 | blame | 历史 | 原始文档

Inference with Triton

Steps:

  1. Refer here to get model.onnx

  2. Follow below instructions to using triton
    ```sh

    using docker image Dockerfile/Dockerfile.server

    docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01
    docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/funasr/runtime/>:/workspace --shm-size 1g --net host triton-paraformer:23.01

    inside the docker container, prepare previous exported model.onnx

    mv <path_model.onnx> /workspace/triton_gpu/model_repo_paraformer_large_offline/encoder/1/

model_repo_paraformer_large_offline/
|-- encoder
| |-- 1
| | -- model.onnx |-- config.pbtxt
|-- feature_extractor
| |-- 1
| | -- model.py | |-- config.pbtxt |-- config.yaml
|-- infer_pipeline
| |-- 1
| -- config.pbtxt -- scoring
|-- 1
| -- model.py |-- config.pbtxt -- token_list.pkl

8 directories, 9 files

launch the service

tritonserver --model-repository ./model_repo_paraformer_large_offline \
--pinned-memory-pool-byte-size=512000000 \
--cuda-memory-pool-byte-size=0:1024000000

```

Performance benchmark

Benchmark speech_paraformer based on Aishell1 test set with a single V100, the total audio duration is 36108.919 seconds.

(Note: The service has been fully warm up.)
|concurrent-tasks | processing time(s) | RTF |
|----------|--------------------|------------|
| 60 (onnx fp32) | 116.0 | 0.0032|

Acknowledge

This part originates from NVIDIA CISI project. We also have TTS and NLP solutions deployed on triton inference server. If you are interested, please contact us.