Monday, 17 November 2025

UET Data Transfer Operation: Introduction

Introduction

The previous chapter described how an application gathers information about available hardware resources and uses that information to initialize the job environment. During this initialization, hardware resources are abstracted and made accessible to the UET provider as objects.

This chapter explains the data transport process, using gradient synchronization as an example.

Figure 5-1 depicts two GPUs—Rank 0 and Rank 2—participating in the same training job (JobID: 101). Both GPUs belong to the same NCCL topology and are connected to the Scale-Out Backend Network’s rail0.

Because the training model is large, each layer of neural network is split across two GPUs using tensor parallelism, meaning that the computations of a single layer are distributed between GPUs. 

During the first forward-pass training iteration, the predicted model output does not match the expected result. This triggers the backward pass process, in which gradients—values indicating how much each weight parameter should be adjusted to improve the next forward-pass prediction—are computed.

Rank 0 computes its gradients, which in Figure 5-1 are stored as a 2D matrix with 3 rows and 1024 columns. The results are stored in a memory space registered for the process in local VRAM. The memory region’s base address is 0x20000000.

The first gradient at row 0, column 0 (index [0,0]) is stored at offset 0. In this example, each gradient value is 4 bytes. Thus, the second gradient at [0,1] is stored at offset 4, and so on. All 1024 gradients in row 0 require 4096 bytes (1024 × 4 bytes) of memory, which corresponds to the Scale-Out Backend Network’s Maximum Transfer Unit (MTU). The entire gradient block stored in Rank 0’s VRAM occupies 12,280 bytes. 

Row 0: [G[0,0], G[0,1], G[0,2], ..., G[0,1023]] → offsets 0, 4, 8, ..., 4092 bytes

Row 1: [G[1,0], G[1,1], G[1,2], ..., G[1,1023]] → offsets 4096, 4100, 4104, ..., 8192 bytes

Row 2: [G[2,0], G[2,1], G[2,2], ..., G[2,1023]] → offsets 8192, 8196, 8200, ..., 12288 bytes


All 1024 gradients in a row require 4096 bytes, which corresponds to the Scale-Out Backend Network’s Maximum Transfer Unit (MTU). The entire gradient block stored in Rank 0’s VRAM occupies 12,280 bytes.

After completing its part of the gradient computation, the application on Rank 0 initiates gradient synchronization with Rank 2 by executing an fi_write RMA operation. This operation writes the gradient data from local memory to remote memory. To perform this operation successfully, the application—together with the UET provider, libfabric core, and the UET NIC—must provide the necessary information for the process, UET NIC, and network:

Semantic (What we want to do): The application describes the intent: Write 12,280 bytes of data from the local registered memory region stared at base memory address 0x20000000 to the corresponding memory region of a process running on Rank 2.

Delivery (How to transport): The application specifies how the data must be delivered: reliable or unreliable transport, with ordered or unordered packet delivery. In figure 5-1, the selected mode is Reliable, Ordered Delivery (ROD).

Forwarding (Where to transport): To route the packets over the Scale-Out Backend Network, the delivery information is encapsulated within Ethernet, IP, and optionally UDP headers.


Figure 5-1: High-Level View of Remote Memory Access Operation.



Application: RMA Write Operation


Figure 5-2 illustrates how an application gathers information for an fi_write RMA operation.

The first field, fid_ep, maps the remote write operation to the endpoint object (fid_ep = 0xFIDAE01; AE stands for Active Endpoint). The endpoint type is FI_EP_RDM, which provides reliable datagram delivery. The endpoint is bound to a registered memory region (fid_mr = 0xFIDDA01), and the RMA operation stores the memory descriptor in the desc field. Gradients reside in this memory region, starting from the base address 0x20000000.

The length field specifies how many bytes should be written to the remote memory. The target of the write is represented by an fi_addr_t value. In this example, it points to the first entry in the Address Vector (AV), which identifies the remote rank and its fabric address (IP address). The AV also references a Resource Index (RI) entry. The RI entry contains the JobID, rank ID, remote memory address, and key required for access.

After collecting all the necessary information, the application invokes the fi_write RMA operation in the libfabric core.

Figure 5-2: RMA Operation – Application: fi_write.

Libfabric Core Validation for RMA Operations


Libfabric core validation ensures that an RMA request is structurally correct, references valid objects, and complies with the capabilities negotiated during endpoint creation. It performs three main types of lightweight checks before handing off the request to the UET provider.

Object Integrity and Type Validation


The core first ensures that all objects referenced by the application are valid and consistent. For fi_write, it verifies that fid_ep is a properly initialized endpoint of the expected class (FI_CLASS_EP) and type (FI_EP_RDM). The memory region fid_mr is checked to confirm it is registered and that the desc field correctly points to the local buffer. The target fi_addr_t is validated against the Address Vector (AV) to ensure it corresponds to a valid remote rank and fabric address. Any associated Resource Index (RI) references are verified for consistency.
Attribute and Capability Compliance

Next, the core verifies that the endpoint supports the requested operation. For fi_write, this means checking that the endpoint provides reliable RMA write capability. It also ensures that any attributes or flags used are compatible with the endpoint’s capabilities, including alignment with memory registration and compliance with the provider’s supported ordering and reliability semantics.

Basic Parameter Sanity Checks 


The core performs sanity checks on operation parameters. It verifies that length is non-zero, does not exceed the registered memory region, and is correctly aligned. Any flags or optional parameters are checked for validity.

Provider Handoff


After passing all validations, the libfabric core hands off the fi_write operation to the provider. The core does not modify the parameters; it only guarantees that the operation references valid objects, complies with endpoint capabilities, and passes all sanity checks. This ensures that the provider receives a well-formed request and can process the RMA operation safely and efficiently. 

Figure 5-3: RMA Operation – Libfaric Core: Lightweight Sanity Check.


No comments:

Post a Comment