python/FunASR-XL.git

			@@ -1,16 +1,21 @@
			## Inference with Triton

			### Steps:
			1. Refer here to [get model.onnx](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime#steps)

			2. Follow below instructions to using triton
			1. Prepare model repo files
			```sh
			# using docker image Dockerfile/Dockerfile.server
			docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01
			docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/funasr/runtime/>:/workspace --shm-size 1g --net host triton-paraformer:23.01
			# inside the docker container, prepare previous exported model.onnx
			mv <path_model.onnx> /workspace/triton_gpu/model_repo_paraformer_large_offline/encoder/1/
			git-lfs install
			git clone https://www.modelscope.cn/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git

			pretrained_model_dir=$(pwd)/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch

			cp $pretrained_model_dir/am.mvn ./model_repo_paraformer_large_offline/feature_extractor/
			cp $pretrained_model_dir/config.yaml ./model_repo_paraformer_large_offline/feature_extractor/

			# Refer here to get model.onnx (https://github.com/alibaba-damo-academy/FunASR/blob/main/funasr/export/README.md)
			cp <exported_onnx_dir>/model.onnx ./model_repo_paraformer_large_offline/encoder/1/
			```
			Log of directory tree:
			```sh
			model_repo_paraformer_large_offline/
			\|-- encoder
			\| \|-- 1
			@@ -20,6 +25,7 @@
			\| \|-- 1
			\| \| `-- model.py
			\| \|-- config.pbtxt
			\| \|-- am.mvn
			\| `-- config.yaml
			\|-- infer_pipeline
			\| \|-- 1
			@@ -27,13 +33,19 @@
			`-- scoring
			\|-- 1
			\| `-- model.py
			\|-- config.pbtxt
			`-- token_list.pkl
			`-- config.pbtxt

			8 directories, 9 files
			```

			2. Follow below instructions to launch triton server
			```sh
			# using docker image Dockerfile/Dockerfile.server
			docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01
			docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/model_repo_paraformer_large_offline>:/workspace/ --shm-size 1g --net host triton-paraformer:23.01

			# launch the service
			tritonserver --model-repository ./model_repo_paraformer_large_offline \
			tritonserver --model-repository /workspace/model_repo_paraformer_large_offline \
			--pinned-memory-pool-byte-size=512000000 \
			--cuda-memory-pool-byte-size=0:1024000000

			@@ -43,10 +55,31 @@

			Benchmark [speech_paraformer](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) based on Aishell1 test set with a single V100, the total audio duration is 36108.919 seconds.

			```sh
			# For client container:
			docker run -it --rm --name "client_test" --net host --gpus all -v <path_host/triton_gpu/client>:/workpace/ soar97/triton-k2:22.12.1 # noqa
			# For aishell manifests:
			apt-get install git-lfs
			git-lfs install
			git clone https://huggingface.co/csukuangfj/aishell-test-dev-manifests
			sudo mkdir -p /root/fangjun/open-source/icefall-aishell/egs/aishell/ASR/download/aishell
			tar xf ./aishell-test-dev-manifests/data_aishell.tar.gz -C /root/fangjun/open-source/icefall-aishell/egs/aishell/ASR/download/aishell/ # noqa

			serveraddr=localhost
			manifest_path=/workspace/aishell-test-dev-manifests/data/fbank/aishell_cuts_test.jsonl.gz
			num_task=60
			python3 client/decode_manifest_triton.py \
			--server-addr $serveraddr \
			--compute-cer \
			--model-name infer_pipeline \
			--num-tasks $num_task \
			--manifest-filename $manifest_path
			```

			(Note: The service has been fully warm up.)
			\|concurrent-tasks \| processing time(s) \| RTF \|
			\|----------\|--------------------\|------------\|
			\| 60 (onnx fp32) \| 116.0 \| 0.0032\|

			## Acknowledge
			This part originates from NVIDIA CISI project. We also have TTS and NLP solutions deployed on triton inference server. If you are interested, please contact us.
			This part originates from NVIDIA CISI project. We also have TTS and NLP solutions deployed on triton inference server. If you are interested, please contact us.