Concurrency in LLMs - Why It Matters More Than size of LLM

LLM Concurrency Importance
#

I feel benchmarks are misleading, everyone has their own usage patterns. Due to constraints I have with my local LLM hardware, I developed methods to work with smaller models of what is good enough. I also refuse to spend more money on expensive and power hungry GPUs. one thing I quickly realized is that, instead of having slower larger models with smaller context lengths, it is better to develop better chunking methods and work with really good small and efficient models. Also since I am giving so much batched input, to avoid queue buildup problems, concurrency speeds up and improves my processing efficiency.

Why Concurrency Matters in production
#

I feel it also matters for LLM service providers because of the following metrics they have to hit

Serving multiple users reality
API endpoint scaling
Cost per request served
User experience impact

Concurrency Strategies
#

Here again I feel the more VRAM I have, the more models I can serve concurrently But the simple thing of not willing to support the current price gouging by Nvidia for high VRAM GPUs. I will work with what I have. I dont want to go with apple since hardware will become obsolete very fast in this space. AMD and Intel are stuck copying Nvidia VRAM strategy but having worse support for LLMs.
continuous batching. I have implemented my own scheduling script by monitoring GPU usage.
pipeline parallelism. This is possible to do inside python and I have a method that allows this.

Hardware Considerations
#

ram vs vram tradeoffs. High speed quad channel RAM can reach the speed of older Nvidia GPUs. Maybe an Eypc CPU with a large RAM capacity allows me to make model copies in ram for the future.
CPU threads needed (not a Bottleneck for me)
Pcie bandwidth limits

Software Solutions
#

Ollama parallel requests
VLLM

Metrics That Matter
#

Batch requests per second that are processed

LLM Concurrency Importance #

Why Concurrency Matters in production #

Concurrency Strategies #

Hardware Considerations #

Software Solutions #

Metrics That Matter #