GROMACS上如何进行GPU加速.pdf
文本预览下载声明
Molecular Simulation with
GROMACS on CUDA GPUs
Erik Lindahl
WebinarWe’re comfortably
on the single-μs
scale today
Larger machines often
mean larger systems,
not necessarily longer
simulations
GROMACS is used on a
wide range of resources
Why use GPUs?
Throughput Performance
? Sampling
? Free energy
? Cost efficiency
? Power efficiency
? Desktop simulation
? Upgrade old machines
? Low-end clusters
? Longer simulations
? Parallel GPU simulation
using In#niband
? High-end efficiency by
using fewer nodes
? Reach timescales not
possible with CPUs
Caveat emperor:
It is much easier to get a reference
problem/algorithm to scale
i.e., you see much better
relative scaling before
introducing any optimization on the CPU side
When comparing programs:
What matters is absolute performance
(ns/day), not the relative speedup!
Many GPU programs today
Gromacs-4.5 with OpenMM
Gromacs running
entirely on CPU as
a fancy interface
Actual simulation running
entirely on GPU
using OpenMM kernels
Only a few select algorithms worked
Multi-CPU sometimes beat GPU performance...
Previous version - what was the limitation?
Why don’t we use the CPU too?
~2 TFLOP0.5-1 TFLOP
Random memory
access OK (not great)
Random memory
access won’t work
Great for
throughput
Great for complex
latency-sensitive stuff
(domain decomposition, etc.)
Programming model
CPU
(PME)
GPU
N OpenMP
threads
1 MPI rank 1 MPI rank 1 MPI rank 1 MPI rank
N OpenMP
threads
N OpenMP
threads
N OpenMP
threads
1 GPU
context
1 GPU
context
1 GPU
context
1 GPU
context
Domain decomposition
dynamic load balancing
Load balancingLoad balancing
Gromacs-4.6 ext-generation GPU implementation:
Heterogeneous CPU-GPU acceleration in GROMACS-4.6
Wallclock time for an MD step:
~0.5 ms if we want to simulate 1μs/day
We cannot afford to lose all previous acceleration tricks!
? Δt limited by fast motions - 1fs
? Remove bond vibrations
? SHAKE (iterative, slow) - 2fs
? Problematic in parallel (won’t work)
? Compromise: constrain h
显示全部