python/FunASR-XL.git

			@@ -1,16 +1,21 @@
			## Inference with Triton

			### Steps:
			1. Refer here to [get model.onnx](https://github.com/alibaba-damo-academy/FunASR/blob/main/funasr/export/README.md)

			2. Follow below instructions to using triton
			1. Prepare model repo files
			```sh
			# using docker image Dockerfile/Dockerfile.server
			docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01
			docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/funasr/runtime/>:/workspace --shm-size 1g --net host triton-paraformer:23.01
			# inside the docker container, prepare previous exported model.onnx
			mv <path_model.onnx> /workspace/triton_gpu/model_repo_paraformer_large_offline/encoder/1/
			git-lfs install
			git clone https://www.modelscope.cn/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git

			pretrained_model_dir=$(pwd)/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch

			cp $pretrained_model_dir/tokens.txt ./model_repo_paraformer_large_offline/scoring/
			cp $pretrained_model_dir/am.mvn ./model_repo_paraformer_large_offline/feature_extractor/

			# Refer here to get model.onnx (https://github.com/alibaba-damo-academy/FunASR/blob/main/funasr/export/README.md)
			cp <exported_onnx_dir>/model.onnx ./model_repo_paraformer_large_offline/encoder/1/
			```
			Log of directory tree:
			```sh
			model_repo_paraformer_large_offline/
			\|-- encoder
			\| \|-- 1
			@@ -20,6 +25,7 @@
			\| \|-- 1
			\| \| `-- model.py
			\| \|-- config.pbtxt
			\| \|-- am.mvn
			\| `-- config.yaml
			\|-- infer_pipeline
			\| \|-- 1
			@@ -28,12 +34,19 @@
			\|-- 1
			\| `-- model.py
			\|-- config.pbtxt
			`-- token_list.pkl
			`-- tokens.txt

			8 directories, 9 files
			8 directories, 10 files
			```

			2. Follow below instructions to launch triton server
			```sh
			# using docker image Dockerfile/Dockerfile.server
			docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01
			docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/model_repo_paraformer_large_offline>:/workspace/ --shm-size 1g --net host triton-paraformer:23.01

			# launch the service
			tritonserver --model-repository ./model_repo_paraformer_large_offline \
			tritonserver --model-repository /workspace/model_repo_paraformer_large_offline \
			--pinned-memory-pool-byte-size=512000000 \
			--cuda-memory-pool-byte-size=0:1024000000

			@@ -43,6 +56,27 @@

			Benchmark [speech_paraformer](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) based on Aishell1 test set with a single V100, the total audio duration is 36108.919 seconds.

			```sh
			# For client container:
			docker run -it --rm --name "client_test" --net host --gpus all -v <path_host/triton_gpu/client>:/workpace/ soar97/triton-k2:22.12.1 # noqa
			# For aishell manifests:
			apt-get install git-lfs
			git-lfs install
			git clone https://huggingface.co/csukuangfj/aishell-test-dev-manifests
			sudo mkdir -p /root/fangjun/open-source/icefall-aishell/egs/aishell/ASR/download/aishell
			tar xf ./aishell-test-dev-manifests/data_aishell.tar.gz -C /root/fangjun/open-source/icefall-aishell/egs/aishell/ASR/download/aishell/ # noqa

			serveraddr=localhost
			manifest_path=/workspace/aishell-test-dev-manifests/data/fbank/aishell_cuts_test.jsonl.gz
			num_task=60
			python3 client/decode_manifest_triton.py \
			--server-addr $serveraddr \
			--compute-cer \
			--model-name infer_pipeline \
			--num-tasks $num_task \
			--manifest-filename $manifest_path
			```

			(Note: The service has been fully warm up.)
			\|concurrent-tasks \| processing time(s) \| RTF \|
			\|----------\|--------------------\|------------\|