Hybrid Heterogeneous Clusters Can Lower the Energy
Consumption of LLM Inference Workloads

Grant Wilkins
gfw27@cam.ac.uk

University of Cambridge
Cambridge, UK

Srinivasan Keshav
sk818@cam.ac.uk

University of Cambridge
Cambridge, UK

Richard Mortier
rmm1002@cam.ac.uk

University of Cambridge
Cambridge, UK

ABSTRACT
Both the training and use of Large Language Models (LLMs) require
large amounts of energy. Their increasing popularity, therefore,
raises critical concerns regarding the energy efficiency and sus-
tainability of data centers that host them. This paper addresses the
challenge of reducing energy consumption in data centers running
LLMs. We propose a hybrid data center model that uses a cost-based
scheduling framework to dynamically allocate LLM tasks across
hardware accelerators that differ in their energy efficiencies and
computational capabilities. Specifically, our workload-aware strat-
egy determines whether tasks are processed on energy-efficient
processors or high-performance GPUs based on the number of in-
put and output tokens in a query. Our analysis of a representative
LLM dataset, finds that this hybrid strategy can reduce CPU+GPU
energy consumption by 7.5% compared to a workload-unaware
baseline.

CCS CONCEPTS
• Computer systems organization → Heterogeneous (hybrid)
systems; • Hardware→ Impact on the environment.

KEYWORDS
sustainable computing, heterogeneous computing, large language
models, artificial intelligence
ACM Reference Format:
Grant Wilkins, Srinivasan Keshav, and Richard Mortier. 2024. Hybrid Het-
erogeneous Clusters Can Lower the Energy Consumption of LLM Inference
Workloads. In The 15th ACM International Conference on Future and Sustain-
able Energy Systems (E-Energy ’24), June 04–07, 2024, Singapore, Singapore.
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3632775.3662830

1 INTRODUCTION
Large Language Models (LLMs) such as OpenAI’s GPT-4 [24] and
Google’s PaLM [4] have become emblematic of the AI revolution,
driving significant advancements not only in natural language un-
derstanding, generation, and translation but also in summarizing
and contextualizing large volumes of textual data. Characterized by
their extensive scale and depth, their deployment demands substan-
tial computational resources and hence poses significant challenges

This work is licensed under a Creative Commons Attribution International
4.0 License.

E-Energy ’24, June 04–07, 2024, Singapore, Singapore
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0480-2/24/06
https://doi.org/10.1145/3632775.3662830

in terms of energy consumption and operational efficiency [38].
The increasing application of LLMs across diverse sectors further
compounds these challenges, because datacenters, which are re-
sponsible for a considerable portion of global electricity consump-
tion, must balance performance targets for LLM tasks running on
heterogeneous hardware with the need for energy efficiency [7, 21].
Increasing the energy efficiency of LLMs thus emerges as both a
technical challenge and an environmental imperative [22].

Traditional data center designs often struggle to best exploit the
capabilities of heterogeneous hardware-based LLMs, particularly
when trying to minimize energy consumption without sacrific-
ing output quality and latency [6]. However, this challenge also
presents an opportunity to innovate in datacenter architecture and
management. We show that by rethinking how GPU resources are
allocated and managed, there is potential to significantly reduce the
energy footprint of LLM deployments while maintaining or even
enhancing computational performance.

We find that a dynamic task-scheduling model that assigns LLM
tasks to GPUs based on the resulting energy efficiency can reduce
overall energy. Moreover, implementing a workload-aware system
for input and output token processing can further reduce energy
usage. Thus, a hybrid datacenter task allocation model, which al-
locates different tasks to different hardware accelerators based on
their system demands, can reduce the overall energy consumption
of LLM inference compared to a workload-unaware baseline.

Our contributions are as follows:

(1) We analyze the energy consumption and runtime of several
7B-parameter LLMs’ across various hardware configurations.

(2) We propose and evaluate a workload-aware scheduler for
LLMs that optimizes energy efficiency based on the size of
input and output token loads, demonstrating a 7.5% decrease
in energy consumption over non-workload-aware baselines.

(3) We release a comprehensive dataset and benchmark suite for
evaluating the energy efficiency of LLM inference, enabling
researchers and practitioners to assess the impact of their
design choices.

Through these contributions, we hope to support more sustain-
able and cost-effective AI inference deployments.

The remainder of this paper is as follows: Section 2 provides
background information on LLM inference and energy consump-
tion in AI systems. Section 3 formulates the problem and introduces
our cost function. Section 4 details the methods used for bench-
marking LLM inference on diverse systems. Section 5 presents the
performance results of LLM inference across multiple hardware con-
figurations. Section 6 proposes and evaluates our energy-optimal
hybrid data center design. Finally, Section 7 discusses related works,
and Section 8 summarizes the conclusions of the paper.

506

https://doi.org/10.1145/3632775.3662830
https://creativecommons.org/licenses/by/4.0/
https://creativecommons.org/licenses/by/4.0/
https://doi.org/10.1145/3632775.3662830
http://crossmark.crossref.org/dialog/?doi=10.1145%2F3632775.3662830&domain=pdf&date_stamp=2024-05-31


E-Energy ’24, June 04–07, 2024, Singapore, Singapore Grant Wilkins, Srinivasan Keshav, and Richard Mortier

2 BACKGROUND
2.1 Inference Using Large Language Models
Transformer-based neural network architectures have led to im-
pressive gains in the performance of LLMs for language under-
standing and generation [5]. LLMs such as OpenAI’s GPT-4 [24]
and Google’s Gemini [32] have demonstrated human-level profi-
ciency on many language benchmarks while requiring billions of
parameters and massive datasets for training. The inference phase
of LLMs involves utilizing a trained model to make predictions
based on new, unseen data. Unlike the training phase, which is
typically a one-time, compute-intensive process that occurs offline,
inference is an ongoing, real-time process that directly impacts
end-user experiences [7]. This phase is critical as it represents the
point at which AI capabilities become accessible to users.

Inference in LLMs can be computationally expensive due to sev-
eral factors: (1) Model Size: The sheer size of these models, often
billions of parameters, necessitates significant computational power
to process each query [38]. (2) Latency Expectations:Many appli-
cations based on LLMs, such as digital assistants, automated writing
aids, and real-time translators, require low-latency responses [35].
(3) Scalability: The ability to scale inference operations to accom-
modate varying user demands without degradation in response
times is crucial.

2.2 Energy Consumption in AI Systems
Recent reports have found that the computational requirements for
state-of-the-art AI entail massive energy consumption and carbon
emissions [7, 21, 26, 29, 38]. The energy intensity of AI systems
can be broadly divided into the energy required for training versus
inference after models are deployed [13]. Training complex models
on massive datasets is an energy-intensive process, with estimates
finding that training GPT-3 required 1,287 megawatt-hours of en-
ergy [26]. LLMs can also have huge emissions depending on deploy-
ment scale and hardware efficiency [29]. For example, over a year
of use, inference by LLMs on cloud infrastructure can consume over
25× more energy than training a model [7]. Optimizing software
and hardware specifically for AI workloads is thus essential [3].

2.3 Heterogeneous Systems for Efficient
Computing

Modern systems demonstrate a complex interplay between scale,
architecture, workload behavior and efficiency objectives. The ar-
chitecture of compute nodes can significantly impact the energy
efficiency and processing capabilities of large-scale computing sys-
tems [18]. Conventional server architectures based on multicore
CPUs face energy proportionality and scalability limitations for
modern data-intensive workloads [20]. Several researchers have
explored heterogeneous server configurations to improve energy ef-
ficiency [12, 15, 16, 19]. Distributed solutions can translate to lower
energy efficiency, as communication overheads dominate [9]. Still,
specialized clusters like NVIDIA’s DGX show 4x better performance
per watt over conventional servers [30].

3 PROBLEM FORMULATION
To model the operational demands of a hybrid, heterogeneous data-
center hosting LLMs, we define a cost function to reflect the work-
load distribution across different systems. We define a cost function
𝑈 (𝑚,𝑛, 𝑠) that accounts for both energy consumption and runtime:

𝑈 (𝑚,𝑛, 𝑠) = 𝜆𝐸 (𝑚,𝑛, 𝑠) + (1 − 𝜆)𝑅(𝑚,𝑛, 𝑠),
where 𝑚 and 𝑛 denote the number of input and output tokens,
respectively. 𝜆 ∈ [0, 1] is a tunable parameter that balances the
weight of energy efficiency versus speed. 𝐸 (𝑚,𝑛, 𝑠) is the energy
consumed by system 𝑠 to process𝑚 input tokens and generate 𝑛
output tokens, measured in joules. 𝑅(𝑚,𝑛, 𝑠) is the time required to
process these tokens on system 𝑠 , measured in seconds.

Our objective is to minimize the total cost across all tasks and
systems:

min
{𝑄𝑠 }𝑠∈𝑆

∑︁
𝑠∈𝑆

∑︁
(𝑚,𝑛) ∈𝑄𝑠

𝑈 (𝑚,𝑛, 𝑠) (1)

s.t.
⋃
𝑠∈𝑆

𝑄𝑠 = 𝑄 (2)

∀𝑠 : 𝑄𝑠 ∩𝑄𝑠′ = ∅ for 𝑠 ≠ 𝑠′ (3)

where 𝑆 is the set of all systems, 𝑄 is the total set of queries, 𝑄𝑠 is
the subset of queries assigned to system 𝑠 .

This model ensures that each query is processed exactly once,
optimizing for energy efficiency or quick response times, depending
on the operational needs, as parameterized by 𝜆. We note, however,
that certain systems may be better suited to specific tasks, based on
the workload characteristics, such as the need for rapid response
times. Adjustments in 𝜆 allow the datacenter to shift its focus be-
tween minimizing energy consumption and reducing runtime as
operational priorities change.

4 METHODS
Here, we describe the methods and tools we use to benchmark LLM
inference. In all cases, we use Huggingface’s Accelerate [11] to stan-
dardize hardware optimization for inference across all platforms. T
his library takes advantage of the available accelerator resources
and shards models accordingly to minimize intermediate commu-
nication and maximize the distributed capabilities for computation
across the devices.

4.1 Model Selection
Our study employs three 7B-parameter, open-source LLMs for their
capabilities and ability to run on diverse hardware efficiently: (1)
Falcon [2], (2) Llama-2 [33], and (3) Mistral [17]. These models
were selected to represent a spectrum of architectures and training
corpora. We subject each model to a series of standardized NLP
tasks to evaluate their energy consumption during inference.

4.1.1 Falcon. The Falcon (7B) [2] model utilizes multi-query atten-
tion, significantly reducing memory requirements and increasing
processing speed. The model’s training on the bilingual RefinedWeb
dataset enhances its applicability across diverse linguistic contexts.

4.1.2 Llama-2. We select Llama-2 (7B) for its optimization in di-
alogue tasks and its improvements in safety and helpfulness. The

507


Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads E-Energy ’24, June 04–07, 2024, Singapore, Singapore

model’s unique pretraining methodologies and advanced architec-
tural features, such as grouped-query attention, make it an ideal
candidate for analyzing energy efficiency in complex language
tasks.

4.1.3 Mistral. We include Mistral (7B) [17] for its grouped-query
attention and sliding window attention mechanisms, contributing
to fast and efficient inference. Its superior performance in vari-
ous benchmarks, especially in reasoning, mathematics, and code
generation, makes it an essential model for our analysis.

4.2 Energy Profiling of Diverse Systems
Depending on the platform, we profile each system’s energy con-
sumption during inference using customized setups that capture
runtime and energy or power metrics. Here, we describe how
we monitor the energy usage of NVIDIA GPUs, Apple Silicon
CPU/GPU, Intel CPUs, and AMD CPUs.

4.2.1 NVIDIA GPUs. We use PyJoules [27], a Python-based en-
ergy measurement library, to quantify the energy consumption
associated with inference on NVIDIA GPUs. PyJoules provides an
interface to NVML [23], providing a software-defined energy usage
assessment for targeted NVIDIA devices. This tool offers real-time
energy consumption of GPUs for a given tracked process, which is
a critical component of our analysis given the GPU-heavy compu-
tation involved in LLM inference.

4.2.2 Apple Silicon CPU/GPU. No standard energy measurement
tools are available for profiling energy and power usage for Ap-
ple Silicon through an API like PyJoules or RAPL. Therefore, we
employ a daemon-based approach to poll macOS’ powermetrics
utility, providing a detailed view of the energy usage during model
inference. To capture the energy consumption of the M1 GPU, we
execute the powermetrics command through a Python subprocess.
This command returns the percentage of the CPU power each CPU
top process uses and the total CPU and GPU power consumption
in 200ms intervals. This interval was chosen after testing to find
the finest granularity measurement without incurring a significant
CPU overhead for the I/O of buffering the large powermetrics
output into memory.

The energy monitoring is conducted concurrently with the LLM
inference. A separate thread is dedicated to running the powermetrics
command, ensuring real-time data collection. Post-inference, the
collected data is processed to extract the recorded power data and
then find the energy consumption through integration over the
runtime. The GPU energy consumption, 𝐸𝑇𝑜𝑡𝑎𝑙,𝐺𝑃𝑈 , is straightfor-
ward to calculate for each recorded power value, 𝑃𝐺𝑃𝑈 ,𝑖 , at each
timestep Δ𝑡𝑖 .

𝐸𝑇𝑜𝑡𝑎𝑙,𝐺𝑃𝑈 =
∑︁
𝑖

𝑃𝐺𝑃𝑈 ,𝑖Δ𝑡𝑖 .

The CPU power draw data is less clear, as many processes run on the
CPU. However, an "energy impact factor" through powermetrics
allows us to infer how much power our Python inference process
uses. Therefore, we calculate the CPU energy, 𝐸𝑇𝑜𝑡𝑎𝑙,𝐶𝑃𝑈 , by mul-
tiplying 𝑃𝐶𝑃𝑈 ,𝑖 by the "energy impact factor," which we denote as
𝛼𝑖 , at each timestep:

𝐸𝑇𝑜𝑡𝑎𝑙,𝐶𝑃𝑈 =
∑︁
𝑖

(𝛼𝑖𝑃𝐶𝑃𝑈 ,𝑖 )Δ𝑡𝑖 .

4.2.3 Intel CPUs. For Intel CPUs, we leverage PyJoules, a Python-
based energy measurement library similar to our approach for
NVIDIA GPUs. This tool supports RAPL (Running Average Power
Limit) interfaces, enabling us to obtain fine-grained energy con-
sumption data [36]. We focus on two primary RAPL domains: Pack-
age 0 and Package 1, which correspond to the entire CPU package’s
energy consumption, including all cores in the package.

PyJoules allows us to capture the energy usage of these domains
in real time, enabling us to profile the energy consumption specif-
ically during model inference tasks. To account for base energy
consumption unrelated to our inference process, we conduct a pre-
analysis phase to measure the CPU’s average idle power draw. This
idle measurement is then subtracted from the total energy con-
sumption during inference to accurately determine the net energy
expenditure attributable to the inference process.

We instrument our code to query the RAPL readings at the start
and end of the inference task, calculating the energy consumption
as follows:

𝐸𝑇𝑜𝑡𝑎𝑙,𝐶𝑃𝑈 =
∑︁
𝑖

( (
𝑃𝑃𝑎𝑐𝑘𝑎𝑔𝑒−0,𝑖 − 𝑃𝑃𝑎𝑐𝑘𝑎𝑔𝑒−0,𝐼𝑑𝑙𝑒

)
+

(
𝑃𝑃𝑎𝑐𝑘𝑎𝑔𝑒−1,𝑖 − 𝑃𝑃𝑎𝑐𝑘𝑎𝑔𝑒−1,𝐼𝑑𝑙𝑒

) )
Δ𝑡𝑖 ,

where 𝑃𝑃𝑎𝑐𝑘𝑎𝑔𝑒−0,𝑖 and 𝑃𝑃𝑎𝑐𝑘𝑎𝑔𝑒−1,𝑖 , represent the power draw
from Package 0 and Package 1, respectively, and 𝑃𝑃𝑎𝑐𝑘𝑎𝑔𝑒−0,𝐼𝑑𝑙𝑒
and 𝑃𝑃𝑎𝑐𝑘𝑎𝑔𝑒−1,𝐼𝑑𝑙𝑒 represent the average idle power draw of the
CPU packages, respectively.

4.2.4 AMD CPUs. We adopt a different strategy for AMD CPUs
due to the absence of a Python API. Instead, we utilize AMD𝜇Prof’s
timechart feature, which provides detailed power draw metrics
for every core on the chip at fine-grained intervals. By polling
AMD𝜇Prof at 100ms intervals, we can capture the power draw of
each physical core throughout the model inference process.

To ensure we accurately attribute the energy consumption to our
inference task, we monitor the CPU core residency through psutil.
This information allows us to identify and record the specific cores
actively engaged in the inference process at each time step. The total
energy consumption for the inference task is then calculated by
summing the power usage across all active cores and summing over
the product of the power usage and time of inference, as follows:

𝐸𝑇𝑜𝑡𝑎𝑙,𝐶𝑃𝑈 =
∑︁
𝑐𝑜𝑟𝑒

(∑︁
𝑖

𝑃𝑐𝑜𝑟𝑒,𝑖Δ𝑡𝑖

)
where 𝑃𝑐𝑜𝑟𝑒,𝑖 represents the power draw of an individual core at
each time step, 𝑖 .

5 LLM INFERENCE PERFORMANCE ON
DIVERSE CLUSTERS

5.1 Hardware and Software Versions
The systems we profile are shown in Table 1. We consider these sys-
tems as they demonstrate three prominent CPU manufactures and
different generations of GPUs. We utilize PyTorch v2.0.1, Torchvi-
sion v0.15.2, Numpy v1.26.0, Huggingface v0.20.2, and Accelerate
v0.26.1.

508


E-Energy ’24, June 04–07, 2024, Singapore, Singapore Grant Wilkins, Srinivasan Keshav, and Richard Mortier

System Name CPU GPU(s) per Node DRAM per Node VRAM per GPU

Macbook Pro 10-core M1 Pro 14-core M1 Pro 32GB -
Swing AMD+A100 2×64-core AMD EPYC 7742 8×NVIDIA A100 1TB 40GB
Palmetto Intel+V100 40-Core Intel Xeon 6148G 2×NVIDIA V100 376GB 16GB

Table 1: Our System Configurations

We note that the M1-Pro results only include the Llama-2 (7B)
and Mistral (7B) results, as Falcon (7B) generally did not complete
tasks in less than two orders of magnitude greater runtime.

5.2 Experimental Strategy
To comprehensively evaluate the performance of different system
configurations across various models, we conducted a series of
controlled experiments. We systematically varied the number of
input and output tokens to measure their effects on runtime and
energy consumption under two main experimental conditions. In
each experiment we do not allow for key-value caches to be re-used
to ensure our testing environment is standardized.

5.2.1 Vary Input Tokens. For the first experimental condition, we
executed inference requests with increasing input token sizes, rang-
ing from 8 to 2048 tokens, while maintaining a fixed output token
size of 32. This setup allowed us to isolate the impact of input size
on the system’s performance and energy efficiency.

5.2.2 Vary Output Tokens. In the second set of experiments, we
varied the output token limit from 8 to 4096 tokens, keeping the
input token size constant at 32. This approach helped us understand
how increasing output demands affect the runtime and energy
consumption of the systems tested.

5.2.3 Randomization and Stopping Criteria. Each experiment was
conducted in a randomized order to mitigate any potential bias
introduced by the sequence of tests. To ensure the reliability of our
results, we adhered to strict criteria for statistical confidence. Each
configuration was tested repeatedly until either of two conditions
was met: (1) The measured runtime had to be within 0.5 seconds of
the actual mean runtime with 95% confidence. (2) A maximum of
25 trials were conducted for each setting if the first condition could
not be met.

5.3 Input Token Analysis
Here, we present the impacts on runtime, energy consumption per
token, and throughput for LLMs across different hardware config-
urations while varying the number of input tokens. We perform
these experiments using the suite of systems outlined in Table 1
with the models outlined in Section 4.1. In our experiments on the
Palmetto Intel+V100 system, the V100 GPU had an out-of-memory
error beyond 1024 output tokens for Falcon (7B).

Our runtime measurements show a significant increase as in-
put tokens grow. As depicted in Figure 1(a), all systems exhibit a
nonlinear escalation in runtime with increasing token counts, with
the M1-Pro system showing the most significant magnitude. This
trend highlights the computational burden imposed by larger input
sizes, particularly on smaller systems that are not as well designed
to handle extensive workloads.

For all systems, we notice that throughput follows a “roofline
model" with increasing input tokens [37]. Figure 1(b) illustrates
these dynamics, indicating an increase in throughput for all systems
until a certain point where inference becomes bound by compute
and not by the overhead of the software, as described by roofline
performance models [37].

Energy efficiency varies markedly across different systems. The
M1-Pro demonstrates consistently low energy consumption per to-
ken, particularly for smaller input sizes, as shown in Figure 1(c). This
efficiency reflects the M1-Pro’s design optimization for low-power
operations. In contrast, the Swing AMD+A100, while capable of
handling more significant token inputs more efficiently, consumed
more energy per token for small workloads yet became more en-
ergy efficient at larger input token sizes, underscoring a trade-off
between workload size and energy efficiency.

5.4 Output Token Analysis
Herewe examine the performance trends associatedwith increasing
the number of output tokens for our LLMs and systems of interest,
specifically focusing on runtime, energy consumption per token,
and throughput. In our experiments, the M1-Pro also could not
generate more than 512 output tokens without significant runtime
penalties. For the Palmetto Intel+V100 system, the V100 GPU had
an OOM error beyond 1024 output tokens for Falcon (7B) and for
all models beyond 2048 tokens.

Runtime significantly increases with the number of output to-
kens across all systems. As illustrated in Figure 2(a), the escala-
tion in runtime is pronounced, particularly as the output token
count reaches higher magnitudes. This increase is indicative of
the substantial computational effort required by LLMs to generate
successive tokens.

In Figure 2(b), we observe a decrease in throughput across all
systems as the number of output tokens increases. This trend high-
lights the inherent computational complexity involved in generat-
ing larger sequences of tokens in LLM tasks. As the output token
count grows, the system must process each additional token, re-
calculating the context and updating internal model states [34].
This not only increases the total computation per query but also
leads to a greater accumulation of processing time per token, which
consequently lowers the overall throughput.

Energy consumption per token also shows an increasing trend
as the number of output tokens grows. Displayed in Figure 2(c),
this trend underscores the energy-intensive nature of producing
larger outputs. Systems such as the M1-Pro, while generally more
energy-efficient, begin to consume more energy per token as output
demands increase, reflecting the intensive processing involved in
output generation.

509


Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads E-Energy ’24, June 04–07, 2024, Singapore, Singapore

2
3

2
4

2
5

2
6

2
7

2
8

2
9

2
10

2
11

Number of Input Tokens

10
−1

10
0

10
1

10
2

10
3

R
un

tim
e 

(s
)

(a) Runtime

2
3

2
4

2
5

2
6

2
7

2
8

2
9

2
10

2
11

Number of Input Tokens

10
0

10
1

10
2

10
3

Th
ro

ug
hp

ut
 (t

ok
en

s/
s)

(b) Throughput

2
3

2
4

2
5

2
6

2
7

2
8

2
9

2
10

2
11

Number of Input Tokens

10
0

10
1

10
2

E
ne

rg
y 

pe
r T

ok
en

 (J
/to

ke
ns

)

System
Swing AMD+A100
Palmetto Intel+V100
M1-Pro
Model
Falcon (7B)
Llama-2 (7B)
Mistral (7B)

(c) Energy per Token

Figure 1: Performance of Various Systems and Models for Processing Variable Input Tokens–Due to the low variance in the
data, error bars are too small to be visible.

2
3

2
4

2
5

2
6

2
7

2
8

2
9

2
10

2
11

2
12

Number of Output Tokens

10
−1

10
0

10
1

10
2

10
3

10
4

R
un

tim
e 

(s
)

(a) Runtime

2
3

2
4

2
5

2
6

2
7

2
8

2
9

2
10

2
11

2
12

Number of Output Tokens

10
−1

10
0

10
1

10
2

10
3

Th
ro

ug
hp

ut
 (t

ok
en

s/
s)

(b) Throughput

2
3

2
4

2
5

2
6

2
7

2
8

2
9

2
10

2
11

2
12

Number of Output Tokens

10
−1

10
0

10
1

10
2

10
3

E
ne

rg
y 

pe
r T

ok
en

 (J
/to

ke
ns

)

System
Swing AMD+A100
Palmetto Intel+V100
M1-Pro
Model
Falcon (7B)
Llama-2 (7B)
Mistral (7B)

(c) Energy per Token

Figure 2: Performance of Various Systems and Models for Processing Variable Output Tokens–Missing data points in M1-Pro
and Palmetto Intel+V100 are due to CUDA out of memory errors. Due to the low variance in the data, error bars are too small

to be visible.

5.5 Comparing the Input and Output Analyses
When comparing Figure 1(a) and Figure 2(a), we observe that in-
creases in the number of output tokens result in a more considerable
increase in runtime than increases in input tokens. The computa-
tional complexity of processing input tokens primarily involves
encoding the input context, which occurs once per input sequence
and follows a more linear computational trajectory. In contrast,
generating output tokens is inherently more complex and iterative.
Each new output token requires the model to run through all its
layers to predict the next token based on an ever-expanding context,
which includes both the initial input and all previously generated
tokens [34]. This ongoing computation involves recalculating atten-
tion across an increasing number of tokens, updating hidden states,
and generating a probability distribution over the vocabulary for
each new token. Consequently, as the number of output tokens
grows, the computational load increases significantly, leading to
more significant runtime increases than processing input tokens.

The impacts on runtime also translate to the throughput, de-
picted in Figure 1(b) and Figure 2(b). There is a noticeable decline
in throughput as output tokens increase, more so than input to-
kens. The decrease in throughput for output tokens is primarily
due to the heightened computational requirements for generating
subsequent tokens, where each token’s generation slows down as
the sequence lengthens. Furthermore, the energy per token also

increases as output tokens grow, as shown in our analysis. The
energy required to generate each output token becomes significant
due to longer passes through the transformer network. We contrast
this with the energy consumption when processing input tokens,
which, despite increasing, does so at a less steep rate.

6 ENERGY-OPTIMAL HYBRID DATACENTER
FOR LLM INFERENCE

Considering the performance results we collect from LLM inference
across multiple systems, we notice that there is an energy-optimal
way to construct a hybrid datacenter with a combination ofM1 Pro’s
and A100s. The intuition behind this is that the energy expended
per token for the M1 Pro is lower than that of the A100 up to a
certain point in the number of input and output tokens as seen in
Figures 1(c) and 2(c). However, the energy efficiency characteristics
are different when varying the number of input and output tokens,
and therefore, we will proceed with separate analyses.

6.1 Number of Input Tokens Analysis
Suppose we have a hybrid data center with M1-Pros and A100s.
Then, we have some workload for an LLM, a set of queries with
some outputs. In such a configuration, we implement a scheduling
heuristic based on a cutoff threshold, 𝑇𝑖𝑛, for input token length.

510


E-Energy ’24, June 04–07, 2024, Singapore, Singapore Grant Wilkins, Srinivasan Keshav, and Richard Mortier

This heuristic dictates that queries with 𝑛 ≤ 𝑇𝑖𝑛 tokens are pro-
cessed on M1 Pro systems, which we have shown have good energy
efficiency with handling smaller computational loads. Conversely,
queries with𝑛 > 𝑇𝑖𝑛 tokens leverage the greater computational abil-
ity of A100 GPUs, which offer greater energy-per-token advantages
for larger tasks despite their higher power usage. We point out that
this is the same method mentioned in the problem formulation in
Eqn. 1, where our queries 𝑄 are partitioned into 𝑄𝑀1 and 𝑄𝐴100
strictly on input and output size.

To find an optimal threshold 𝑇𝑖𝑛 empirically, we analyze the to-
ken distribution in prompts from the Alpaca [31] dataset, a bench-
mark dataset frequently used in model fine-tuning. This dataset
comprises 52K prompts, offering a diverse range of lengths akin to
a typical workload in systems like GPT-4 [24]. The distribution of
input tokens, visualized in our analysis (see Fig. 3(a)), serves as a
proxy for understanding the variegated nature of LLM workloads.

0 20 40 60 80 100
Number of Input Tokens

0

2000

4000

6000

8000

Fr
eq

ue
nc

y

(a) Input Tokens

0 200 400 600
Number of Output Tokens

0

2000

4000

6000

8000

Fr
eq

ue
nc

y

(b) Output Tokens

Figure 3: Distribution of Token Counts for Alpaca [31]
The energy component of our cost function, split over the token

threshold, is as follows:

𝐸𝑇𝑜𝑡𝑎𝑙,𝑖𝑛 =

𝑇𝑖𝑛∑︁
𝑚=1

𝑚𝑓𝑖𝑛 (𝑚)𝐸𝑀1,𝑖𝑛 (𝑚) +
𝑀∑︁

𝑚=𝑇𝑖𝑛+1
𝑚𝑓𝑖𝑛 (𝑚)𝐸𝐴100,𝑖𝑛 (𝑚),

where 𝐸𝑇𝑜𝑡𝑎𝑙,𝑖𝑛 represents the total energy consumption for a given
dataset of input lengths𝑚 with corresponding frequencies 𝑓𝑖𝑛(𝑚),
and 𝐸𝑀1,𝑖𝑛 (𝑚) and 𝐸𝐴100,𝑖𝑛 (𝑚) denote the mean energy per token
for varying the input token size for the M1-Pro and A100 systems,
respectively. Utilizing this model with our dataset enables the ap-
proximation of total energy consumption for various threshold
settings, offering insights into the energy dynamics of hybrid dat-
acenter operation. In Figure 4, we show the energy and runtime
simulation results of performing inference for the input token sizes
from the Alpaca dataset.

Our findings indicate that a threshold of 32 tokens strikes an
optimal balance, significantly reducing energy consumption by
relegating the inference of shorter queries to the more energy-
efficient M1 Pro systems. This policy not only capitalizes on the
inherent energy efficiency of the M1 Pro for smaller tasks but
also reserves the computational might of the A100 for queries that
necessitate its robust capabilities. However, it’s important to note
that this energy optimization comes at the cost of increased runtime.

6.2 Number of Output Tokens Analysis
We want to use the same scheduling heuristic and performance
model to determine a threshold 𝑇𝑜𝑢𝑡 for the number of output

2
3

2
4

2
5

2
6

2
7

2
8

2
9

2
10

2
11

Threshold

0.76

0.78

0.80

To
ta

l E
ne

rg
y 

(k
W

h)

M1-Pro Only
Swing AMD+A100 Only
Hybrid System

(a) Energy Consumption for Changing𝑇𝑖𝑛

2
3

2
4

2
5

2
6

2
7

2
8

2
9

2
10

2
11

Threshold

0.5

1.0

R
un

tim
e 

(s
)

1e7

M1-Pro Only
Swing AMD+A100 Only
Hybrid System

(b) Runtime for Changing𝑇𝑖𝑛

Figure 4: Performance of Hybrid Datacenter for Input
Tokens Processing Alpaca–Dashed line shows the value for

using only one kind of hardware for inference

tokens. Except this time, we have different frequencies 𝑓𝑜𝑢𝑡 (𝑛)
for the 𝑛 output tokens and different mean energy per token for
varying the output token size, 𝐸𝑀1,𝑜𝑢𝑡 (𝑛) and 𝐸𝐴100,𝑜𝑢𝑡 (𝑛). We
also utilize the distribution of the number of output tokens in the
Alpaca dataset (see Fig. 3(b)). We revise our performance model as
follows:

𝐸𝑇𝑜𝑡𝑎𝑙,𝑜𝑢𝑡 =

𝑇𝑜𝑢𝑡∑︁
𝑛=1

𝑛𝑓𝑜𝑢𝑡 (𝑛)𝐸𝑀1,𝑜𝑢𝑡 (𝑛)

+
𝑁∑︁

𝑛=𝑇𝑜𝑢𝑡+1
𝑛𝑓𝑜𝑢𝑡 (𝑛)𝐸𝐴100,𝑜𝑢𝑡 (𝑛) .

As the M1 Pro could only generate up to 512 tokens of a response,
we only test𝑇𝑜𝑢𝑡 up until this point. In Figure 5, we show the energy
and runtime simulation results of performing inference for the input
token sizes from the Alpaca dataset.

Fig. 5(b) and Fig. 2(c) assess the energy consumption and runtime
implications of various threshold settings for output generation.
Our findings suggest that although higher thresholds may leverage
the M1 Pro’s energy efficiency for smaller outputs, there is an opti-
mal point at 32 output tokens that minimizes energy consumption.

6.3 Balancing Energy Efficiency and Runtime
Performance

Our analysis of both input and output token processing within a
hybrid, heterogeneous datacenter framework has led to the identifi-
cation that with certain thresholds at 𝑇𝑖𝑛𝑝𝑢𝑡 = 32 and 𝑇𝑜𝑢𝑡𝑝𝑢𝑡 = 32,

511


Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads E-Energy ’24, June 04–07, 2024, Singapore, Singapore

2
3

2
4

2
5

2
6

2
7

2
8

2
9

Threshold

0.66

0.68

0.70

To
ta

l E
ne

rg
y 

(k
W

h)

M1-Pro Only
Swing AMD+A100 Only
Hybrid System

(a) Energy Consumption for Changing𝑇𝑜𝑢𝑡

2
3

2
4

2
5

2
6

2
7

2
8

2
9

Threshold

0.5

1.0

1.5

R
un

tim
e 

(s
)

1e7

M1-Pro Only
Swing AMD+A100 Only
Hybrid System

(b) Runtime for Changing𝑇𝑜𝑢𝑡

Figure 5: Performance of Hybrid Datacenter for Output
Tokens Processing Alpaca – Dashed line shows the value for

using only one kind of hardware for inference

we can strategically allocate tasks to M1 Pro systems or A100 GPUs
based on token count, optimizing for energy efficiency.

Shifting the token distribution leverages the M1 Pro’s superior
energy efficiency for input and output tasks up to the threshold,
beyond which we utilize the A100’s computational power. This
policy saves energy as smaller-token tasks are handled by the more
efficient M1 Pro for outputs up to the threshold. However, this
energy optimization comes at the expense of increased runtime,
which is particularly noticeable in output token generation where
the M1 Pro, despite its efficiency, does not match the A100’s speed.

The energy-runtime trade-off presents a favorable scenario for
applications that have low runtime sensitivity. For instance, batch
processing of LLM tasks, such as overnight data analyses or non-
time-critical computations, can benefit significantly from this energy-
efficient configuration. Similarly, free or not directly monetized
services, where the cost of computation impacts operational sus-
tainability, stand to gain fromminimizing energy expenditures even
at the cost of longer processing times.

This approach also opens discussions on Quality of Service (QoS)
for LLMs, an area that still needs to be explored [1, 35]. Traditional
QoS metrics often prioritize speed and reliability, but energy effi-
ciency may also become a critical QoS dimension for LLM applica-
tions, particularly in energy-constrained or cost-sensitive scenarios.

7 RELATEDWORK
7.1 Hybrid and Energy Efficient Heterogeneous

Data Centers
Recent studies in optimizing data center architectures for deep learn-
ing have highlighted the necessity of energy-efficient scheduling

and task allocation across diverse hardware. Gu et al. [10] explore
GPU clusters’ energy-efficient scheduling, revealing substantial im-
provements in power utilization without considering diverse GPU
types for different task requirements. This work highlights a gap
in understanding how various GPU configurations could enhance
energy efficiency further. Similarly, Patel et al. [25] demonstrate
the benefits of hybrid computing environments, emphasizing FPGA
over GPU diversity. This focus leaves room to explore the specific
impacts of different GPU classes in such settings.

In the realm of LLMs, Zhao et al. [39] introduce strategies like
phase-aware partitioning and adaptive quantization in heteroge-
neous clusters but do not integrate energy considerations into their
analysis, which is crucial for understanding the real-world appli-
cability of these models in power-sensitive environments. On the
other hand, Radovanović et al. [28] and Chien et al. [7] discuss
broader aspects of carbon-aware computing and reducing the car-
bon impact of AI inference, respectively. These works emphasize
the importance of node/device-level energy metrics, often over-
looked in typical LLM deployment strategies, thus underscoring
the need for detailed energy consumption profiling across different
models and hardware types.

7.2 LLM Inference as a Service
Further focusing on energy consumption, Hu et al. [14] analyze
deep learning workloads in GPU datacenters, offering insights into
energy conservation strategies through workload scheduling. This
research aligns with our objectives by confirming the critical role
of scheduling in reducing energy footprints. Anderson et al. [3]
propose carbon-aware datacenter software that could complement
physical hardware adjustments by making energy and carbon met-
rics visible to application developers, encouraging more energy-
efficient coding practices.

Addressing service quality, Wang et al. [35] study the efficiency
and reliability of LLM serving, highlighting the challenges of main-
taining high-quality service while managing computational loads
effectively. This perspective is pertinent as it underscores the trade-
off between performance and energy efficiency, which is central to
our study. Lastly, Desislavov et al. [8] provide a timely examination
of trends in AI inference energy consumption, arguing that while
performance has increased dramatically, energy consumption has
not escalated at the same pace, thanks to hardware optimizations
and algorithmic innovations. This outlook is necessary as it sug-
gests the potential for further optimizations in LLM inference tasks,
which are typically energy-intensive.

8 CONCLUSIONS AND FUTUREWORK
Future work will explore minimizing the energy and runtime and
maximizing the accuracy of serving differently-sized LLMs. Larger
models are generally more accurate but come at the expense of
requiring more hardware accelerators and often greater runtime;
therefore, exploring this trade-off is highly relevant. Also, we plan to
make our solution for energy-optimal routing of incoming queries
an online decision-making heuristic to increase its efficacy. Simi-
larly, we aim to extend our energymodel to reflect carbon awareness
and water consumption to decrease the environmental impact of
LLM inference further.

512


E-Energy ’24, June 04–07, 2024, Singapore, Singapore Grant Wilkins, Srinivasan Keshav, and Richard Mortier

By carefully analyzing the energy and runtime of heterogeneous
compute hardware to host LLMs, we show that a hybrid, hetero-
geneous datacenter and a cost-based scheduling framework can
allocate LLM tasks to accelerators that are best suited to run them
in terms of energy efficiency and computational performance. This
decision is based simply on the size of input and output tokens,
making the decision process easy to integrate into existing work-
loads.

ACKNOWLEDGMENTS
We gratefully acknowledge the computing resources provided on
Swing and Palmetto, both high-performance computing clusters
operated by the Laboratory Computing Resource Center at Argonne
National Laboratory and Clemson University, respectively. During
this work GW was supported by a Churchill Scholarship.

REFERENCES
[1] Megha Agarwal, Asfandyar Qureshi, Linden Li Nikhil Sardana, Julian Quevedo,

and Daya Khudia. 2023. LLM Inference Performance Engineering: Best
Practices. https://www.databricks.com/blog/llm-inference-performance-
engineering-best-practices

[2] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, et al. 2023. The
Falcon Series of Language Models: Towards Open Frontier Models. (2023).

[3] Thomas Anderson, Adam Belay, Mosharaf Chowdhury, Asaf Cidon, and Irene
Zhang. 2023. Treehouse: A Case For Carbon-Aware Datacenter Software. SIGEN-
ERGY Energy Inform. Rev. 3, 3 (oct 2023), 64–70. https://doi.org/10.1145/3630614.
3630626

[4] Rohan Anil, Andrew M. Dai, Orhan Firat, et al. 2023. PaLM 2 Technical Report.
arXiv:2305.10403 [cs.CL]

[5] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, and et al. 2022. On the Oppor-
tunities and Risks of Foundation Models. arXiv:2108.07258 [cs.LG]

[6] Le Chen, Nesreen K. Ahmed, Akash Dutta, et al. 2024. The Landscape and
Challenges of HPC Research and LLMs. arXiv:2402.02018 [cs.LG]

[7] Andrew A Chien, Liuzixuan Lin, Hai Nguyen, Varsha Rao, Tristan Sharma, and
Rajini Wijayawardana. 2023. Reducing the Carbon Impact of Generative AI
Inference (Today and in 2035). In Proceedings of the 2nd Workshop on Sustainable
Computer Systems (Boston, MA, USA) (HotCarbon ’23). Association for Computing
Machinery, New York, NY, USA, Article 11, 7 pages. https://doi.org/10.1145/
3604930.3605705

[8] Radosvet Desislavov, Fernando Martínez-Plumed, and José Hernández-Orallo.
2023. Trends in AI inference energy consumption: Beyond the performance-vs-
parameter laws of deep learning. Sustainable Computing: Informatics and Systems
38 (2023), 100857. https://doi.org/10.1016/j.suscom.2023.100857

[9] Yiannis Georgiou, David Glesser, and Denis Trystram. 2015. Adaptive Resource
and Job Management for Limited Power Consumption. In Proceedings of the
2015 IEEE International Parallel and Distributed Processing Symposium Workshop
(IPDPSW ’15). IEEE Computer Society, USA, 863–870. https://doi.org/10.1109/
IPDPSW.2015.118

[10] Diandian Gu, Xintong Xie, Gang Huang, Xin Jin, and Xuanzhe Liu. 2023. Energy-
Efficient GPU Clusters Scheduling for Deep Learning. arXiv:2304.06381 [cs.DC]

[11] Sylvain Gugger, Lysandre Debut, Thomas Wolf, et al. 2022. Accelerate: Training
and inference at scale made simple, efficient and adaptable. https://github.com/
huggingface/accelerate.

[12] Yuxiong He and Sameh Elnikety. 2011. Position paper: embracing heterogeneity-
improving energy efficiency for interactive services. In Proceedings of the 8th AAAI
Conference on AI for Data Center Management and Cloud Computing (AAAIWS’11-
08). AAAI Press, New York, NY, 11–14.

[13] Peter Henderson, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky, and
Joelle Pineau. 2020. Towards the Systematic Reporting of the Energy and Carbon
Footprints of Machine Learning. J. Mach. Learn. Res. 21, 1, Article 248 (jan 2020),
43 pages.

[14] Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, and Tianwei Zhang.
2021. Characterization and prediction of deep learning workloads in large-
scale GPU datacenters. In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis (<conf-loc>, <city>St.
Louis</city>, <state>Missouri</state>, </conf-loc>) (SC ’21). Association for
Computing Machinery, New York, NY, USA, Article 104, 15 pages. https:
//doi.org/10.1145/3458817.3476223

[15] Xiaoxuan Hu, Peng Li, and Yanfei Sun. 2021. Minimizing energy cost for green
data center by exploring heterogeneous energy resource. Journal of Modern
Power Systems and Clean Energy 9, 1 (2021), 148–159.

[16] Mehboob Hussain, Lian-Fu Wei, Abdullah Lakhan, Samad Wali, Soragga Ali,
and Abid Hussain. 2021. Energy and performance-efficient task scheduling in
heterogeneous virtualized cloud computing. Sustainable Computing: Informatics
and Systems 30 (2021), 100517.

[17] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, and et al. 2023. Mistral
7B. arXiv:2310.06825 [cs.CL]

[18] Willis Lang, Jignesh M. Patel, and Srinath Shankar. 2010. Wimpy node clusters:
what about non-wimpy workloads?. In Proceedings of the Sixth International
Workshop on Data Management on New Hardware (Indianapolis, Indiana) (DaMoN
’10). Association for Computing Machinery, New York, NY, USA, 47–55. https:
//doi.org/10.1145/1869389.1869396

[19] Wenyu Liu, Yuejun Yan, Yimeng Sun, Hongju Mao, Ming Cheng, Peng Wang,
and Zhaohao Ding. 2023. Online job scheduling scheme for low-carbon data
center operation: An information and energy nexus perspective. Applied Energy
338 (2023), 120918.

[20] David Lo, Liqun Cheng, Rama Govindaraju, et al. 2015. Heracles: Improv-
ing resource efficiency at scale. In 2015 ACM/IEEE 42nd Annual International
Symposium on Computer Architecture (ISCA). ACM, New York, NY, 450–462.
https://doi.org/10.1145/2749469.2749475

[21] Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. 2022. Es-
timating the Carbon Footprint of BLOOM, a 176B Parameter Language Model.
arXiv:2211.02001 [cs.LG]

[22] David Mytton and Masaō Ashtine. 2022. Sources of data center energy estimates:
A comprehensive review. Joule 6, 9 (2022), 2032–2056. https://doi.org/10.1016/j.
joule.2022.07.011

[23] NVIDIA. Accessed 2024. NVIDIA-NVML. https://docs.nvidia.com/deploy/nvml-
api/index.html. Available online.

[24] OpenAI, :, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. 2023. GPT-4
Technical Report. arXiv:2303.08774 [cs.CL]

[25] Pratyush Patel, Katie Lim, Kushal Jhunjhunwalla, AshlieMartinez, MaxDemoulin,
Jacob Nelson, Irene Zhang, and Thomas Anderson. 2023. Hybrid Computing for
Interactive Datacenter Applications. arXiv:2304.04488 [cs.DC]

[26] David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia,
Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon Emissions
and Large Neural Network Training. arXiv:2104.10350 [cs.LG]

[27] powerapi ng. 2024. PyJoules: Python-based energy measurement library for
various domains including NVIDIA GPUs. https://github.com/powerapi-ng/
pyJoules. Accessed: 2024-01-10.

[28] Ana Radovanović, Ross Koningstein, Ian Schneider, Bokan Chen, Alexandre
Duarte, Binz Roy, Diyue Xiao, Maya Haridasan, Patrick Hung, Nick Care, et al.
2022. Carbon-aware computing for datacenters. IEEE Transactions on Power
Systems 38, 2 (2022), 1270–1280.

[29] Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas,
Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay
Gadepally. 2023. From Words to Watts: Benchmarking the Energy Costs of Large
Language Model Inference. arXiv:2310.03003 [cs.CL]

[30] Matej Špeťko, Ondřej Vysocký, Branislav Jansík, and Lubomír Říha. 2021. DGX-
A100 Face to Face DGX-2—Performance, Power and Thermal Behavior Evaluation.
Energies 14, 2 (2021). https://doi.org/10.3390/en14020376

[31] R. Taori, I. Gulrajani, T. Zhang, and et al. 2024. Stanford alpaca: An instruction
following llama model. https://github.com/tatsu-lab/stanford_alpaca. Accessed:
2024-01-15.

[32] Google Gemini Team. 2024. Gemini: A Family of Highly Capable Multimodal
Models. arXiv:2312.11805 [cs.CL]

[33] Hugo Touvron, Louis Martin, Kevin Stone, and et al. 2023. Llama 2: Open
Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]

[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. 2017. Attention is all you
need. In Proceedings of the 31st International Conference on Neural Information
Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates
Inc., Red Hook, NY, USA, 6000–6010.

[35] Yuxin Wang, Yuhan Chen, Zeyu Li, Zhenheng Tang, Rui Guo, Xin Wang, Qiang
Wang, Amelie Chi Zhou, and Xiaowen Chu. 2024. Towards Efficient and Reliable
LLM Serving: A Real-World Workload Study. arXiv:2401.17644 [cs.DC]

[36] Vincent M. Weaver, Matt Johnson, Kiran Kasichayanula, James Ralph, Piotr
Luszczek, Dan Terpstra, and Shirley Moore. 2012. Measuring Energy and Power
with PAPI. In 2012 41st International Conference on Parallel Processing Workshops.
262–268. https://doi.org/10.1109/ICPPW.2012.39

[37] Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an
insightful visual performance model for multicore architectures. Commun. ACM
52, 4 (apr 2009), 65–76. https://doi.org/10.1145/1498765.1498785

[38] Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, and et al. 2022. Sustainable
ai: Environmental implications, challenges and opportunities. Proceedings of
Machine Learning and Systems 4 (2022), 795–813.

[39] Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. LLM-
PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and
Adaptive Quantization. arXiv preprint arXiv:2403.01136 (2024).

513

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
https://doi.org/10.1145/3630614.3630626
https://doi.org/10.1145/3630614.3630626
https://arxiv.org/abs/2305.10403
https://arxiv.org/abs/2108.07258
https://arxiv.org/abs/2402.02018
https://doi.org/10.1145/3604930.3605705
https://doi.org/10.1145/3604930.3605705
https://doi.org/10.1016/j.suscom.2023.100857
https://doi.org/10.1109/IPDPSW.2015.118
https://doi.org/10.1109/IPDPSW.2015.118
https://arxiv.org/abs/2304.06381
https://github.com/huggingface/accelerate
https://github.com/huggingface/accelerate
https://doi.org/10.1145/3458817.3476223
https://doi.org/10.1145/3458817.3476223
https://arxiv.org/abs/2310.06825
https://doi.org/10.1145/1869389.1869396
https://doi.org/10.1145/1869389.1869396
https://doi.org/10.1145/2749469.2749475
https://arxiv.org/abs/2211.02001
https://doi.org/10.1016/j.joule.2022.07.011
https://doi.org/10.1016/j.joule.2022.07.011
https://docs.nvidia.com/deploy/nvml-api/index.html
https://docs.nvidia.com/deploy/nvml-api/index.html
https://arxiv.org/abs/2303.08774
https://arxiv.org/abs/2304.04488
https://arxiv.org/abs/2104.10350
https://github.com/powerapi-ng/pyJoules
https://github.com/powerapi-ng/pyJoules
https://arxiv.org/abs/2310.03003
https://doi.org/10.3390/en14020376
https://github.com/tatsu-lab/stanford_alpaca
https://arxiv.org/abs/2312.11805
https://arxiv.org/abs/2307.09288
https://arxiv.org/abs/2401.17644
https://doi.org/10.1109/ICPPW.2012.39
https://doi.org/10.1145/1498765.1498785

	Abstract
	1 Introduction
	2 Background
	2.1 Inference Using Large Language Models
	2.2 Energy Consumption in AI Systems
	2.3 Heterogeneous Systems for Efficient Computing

	3 Problem Formulation
	4 Methods
	4.1 Model Selection
	4.2 Energy Profiling of Diverse Systems

	5 LLM Inference Performance on Diverse Clusters
	5.1 Hardware and Software Versions
	5.2 Experimental Strategy
	5.3 Input Token Analysis
	5.4 Output Token Analysis
	5.5 Comparing the Input and Output Analyses

	6 Energy-Optimal Hybrid Datacenter for LLM Inference
	6.1 Number of Input Tokens Analysis
	6.2 Number of Output Tokens Analysis
	6.3 Balancing Energy Efficiency and Runtime Performance

	7 Related Work
	7.1 Hybrid and Energy Efficient Heterogeneous Data Centers
	7.2 LLM Inference as a Service

	8 Conclusions and Future Work
	Acknowledgments
	References