LLM Concurrency Importance #
I feel benchmarks are misleading, everyone has their own usage patterns. Due to constraints I have with my local LLM hardware, I developed methods to work with smaller models of what is good enough. I also refuse to spend more money on expensive and power hungry GPUs. one thing I quickly realized is that, instead of having slower larger models with smaller context lengths, it is better to develop better chunking methods and work with really good small and efficient models. Also since I am giving so much batched input, to avoid queue buildup problems, concurrency speeds up and improves my processing efficiency.
Why Concurrency Matters in production #
I feel it also matters for LLM service providers because of the following metrics they have to hit
- Serving multiple users reality
- API endpoint scaling
- Cost per request served
- User experience impact
Concurrency Strategies #
- Here again I feel the more VRAM I have, the more models I can serve concurrently But the simple thing of not willing to support the current price gouging by Nvidia for high VRAM GPUs. I will work with what I have. I dont want to go with apple since hardware will become obsolete very fast in this space. AMD and Intel are stuck copying Nvidia VRAM strategy but having worse support for LLMs.
- continuous batching. I have implemented my own scheduling script by monitoring GPU usage.
- pipeline parallelism. This is possible to do inside python and I have a method that allows this.
Hardware Considerations #
- ram vs vram tradeoffs. High speed quad channel RAM can reach the speed of older Nvidia GPUs. Maybe an Eypc CPU with a large RAM capacity allows me to make model copies in ram for the future.
- CPU threads needed (not a Bottleneck for me)
- Pcie bandwidth limits
Software Solutions #
- Ollama parallel requests
- VLLM
Metrics That Matter #
- Batch requests per second that are processed