LLMs at Scale

CIS1902 Python Programming

Agenda

  1. LLM Engineering
  2. LLM Architecture
  3. LLM Parallelization

LLM Engineering

In a system like OpenAI, essentially there are two pipelines: the training of the model and the serving of the model.

Importantly, the distinction between the two is the real-time constraint. Training and testing models can be done slowly, whereas serving the model needs to be as fast as possible.

For training, typically a separate pipeline will exist offline for researchers to test new iterations of the model. This could involve incorporating new training data, formatting and parsing training data, novel changes to the underlying algorithms, etc.

LLM Engineering

How do we answer millions of LLM queries per day in real time?

Serving the model is a scalability and distributed systems problem. The model itself is huge, on the order of a multiple terabytes. Furthermore, inference on an LLM can cost 10-100x more compute than a traditional web application.

How do we make this efficient?

Business Focused Data Science

At a company like Meta or Google, the problem is quite different. These companies generate tons of data everyday on their users. Their data scientists are not training a single large model, but instead need to be able to train many models to answer business questions. Eventually, they may need to serve these models as well.

Typically, these companies will aggregate copies of the data that they need for data science into a single location, sometimes called a datalake. This allows for fast iteration when developing smaller models to answer business questions. If they need to serve the models, they do so in a similar way to LLMs.

LLM Architecture

llm architecture

LLM Architecture

Tokenization and embedding are typically "easy" in the sense that they can be done on a single GPU.

The challenge is computing the transformer blocks. For the largest models, these take up on the order of terabytes of storage just for their parameters.

If we use a single machine, it needs to be massive! How do we distribute the computation for a neural network?

Splitting up a Neural Network

Neural Network

Splitting up a Neural Network

There are quite a few approaches for parallelizing neural network inference, but we'll just cover pipeline parallelism and tensor parallelization.

Typically, regardless of the method, parallelization requires a high-bandwidth environment between all machines. This is usually the case, though, if you set up everything within the same cloud provider (AWS/Google Cloud).

Pipeline Parallelism

pipeline parallelism

Pipeline Parallelism

  • The idea is very simple, we just assign chunks of layers to each GPU. This way, the amount of parameters a single machine needs is much less!
  • Each machine needs to communicate outputs to lower layers (inference).
  • Can only parallelize at most the number of layers.

Tensor Parallelism

tensor parallelism

Tensor Parallelism

  • We leverage the fact that the majority of neural net inference is matrix multiplication. We notice that we can "chunk" the output computation.
  • To compute , we just need and .
  • This generalizes to chunk to any size, so we can parallelize as much as we need!
  • However, more communication bandwidth is needed. There are additional computations like aggregations and non-linear transformations that combine the outputs.