[Thesis]. Manchester, UK: The University of Manchester; 2020.
Reducing latency and increasing the throughput of issued data transfers is a core
requirement if we are to meet the needs of future systems at scale, and therefore,
fast memory delivery to applications is a core component that needs optimization in
order to meet this requirement.
The demand for memory capacity from applications has always
challenged the available technologies and therefore it is important to
understand that this demand and the consequential limitations in
various aspects led to the appearance of new memory technologies and system designs.
Fundamentally, not a single solution has
managed to fully solve this memory capacity challenge.
As argued in this thesis, limitations by physical laws make the effort
of expanding local off-chip memory impossible without adopting
new approaches. The concept of Non-Unified Memory Architecture
(NUMA) provides more system memory by using pools of processors, each with their memories,
to workaround the physical
constraints on a single processor, but the additional system complexities and costs
led to various scalability issues that deter any further system expansion using this
Computer clusters were the first configurations to eventually
provide a Distributed Shared Memory (DSM) system at a linear cost
while also being more scalable than the traditional cache-coherent
NUMA systems, however, this was achieved by using additional
software mechanisms that introduce significant latency when accessing the increased
memory capacity. As this thesis describes, since the initial software DSM systems,
a lot of effort has been invested to create simpler and higher performance solutions
including software libraries, language extensions, high performance interconnects
abstractions via system hypervisors, where each approach allows
a more efficient way of memory resource allocation and usage
across nodes in a machine cluster.
Despite such efforts, fundamental problems such as maintaining cache coherence across
a scaled system with thousands of nodes are not something that any of the current
approaches are capable of efficiently providing, and therefore the requirement of
delivering a scalable memory capacity still poses a real challenge for system architects.
New design concepts and technologies, such as 3D stacked RAM and the Unimem architecture,
are promising and can offer a substantial increase in performance and memory capacity,
but together there is no generally accepted and effective solution to provide DSM.
On a DSM system, efficient and fast data movement across the network is a major performance
and scalability factor. For that reason, this thesis presents a way to change bus
transactions in a system, through a mechanism that reduces the latency of small-sized
data transfers across system nodes. This is accomplished by implementing and evaluating
a software function that conducts data transfers either by using Remote Direct Memory
Access (RDMA) accelerators, or native load stores issued by the processor. By conducting
measurements, it is found that processor native load/stores are beneficial and can
provide up to a seven-fold decrease in transfer latency by overcoming the RDMA limitations
in small transfers, while the latter transfer method is found superior and should
be used beyond a particular transfer size threshold.
Therefore, by combining the benefits of both mechanisms, it is possible to accelerate
data movement for small packets on the system global data bus, thus delivering the
best performance at zero-cost in terms of resources required.
The system in which the evaluation took place consists of Xilinx state-of-the-art
FPGA devices with custom hardware designs created by the tools provided by the FPGA
Because of the promising results that this thesis illustrates, these findings can
be useful for many application domains. Any parallel application that is based on
stencil computations will benefit because of frequent updates of the global array
elements as well as synchronization messages. One such example is computational fluid
dynamics simulators. In such cases, RDMA cannot fit well because of the inefficiencies
when it comes to small transfer sizes, mainly due to the setup time required.
Other domains that can benefit is Machine Learning, Deep Learning, and Distributed
Graph analytics, mainly because of the large number of synchronization messages required
as well as the frequent element updates.
Moreover, in a global address space memory scheme where the address space is partitioned
to each node and multiple processes running on the system, the memory access model
should provide memory isolation and be able to guarantee non-interference between
applications' or processes' separate virtual address spaces. The limitations of the
current generation of ARM (ARMv8 and AArch64) address scheme simply do not provide
enough address bits for physical or virtual addresses for a large scale cluster with
thousands of nodes. Therefore, efforts have been made to lift these limitations with
custom hardware support, but generally, improving this important subsystem is crucial
for future DSM systems with large memory capacities per node.