Thursday, 27 November 2025

UET Relative Addressing and Its Similarities to VXLAN

 Relative Addressing


As described in the previous section, applications use endpoint objects as their communication interfaces for data transfer. To write data from local memory to a target memory region on a remote GPU, the initiator must authorize the local UE-NIC to fetch data from local memory and describe where that data should be written on the remote side.

To route the packet to the correct Fabric Endpoint (FEP), the application and the UET provider must supply the FEP’s IP address (its Fabric Address, FA). To determine where in the remote process’s memory the received data belongs, the UE-NIC must also know:

  • Which job the communication belongs to
  • Which process within that job owns the target memory
  • Which Resource Index (RI) table should be used
  • Which entry in that table describes the exact memory location

This indirection model is called relative addressing.

How Relative Addressing Works

Figure 5-6 illustrates the concept. Two GPUs participate in distributed training. A process on GPU 0 with global rank 0 (PID 0) receives data from GPU 1 with global rank 1 (PID 1). The UE-NIC determines the target Fabric Endpoint (FEP) based on the destination IP address (FA = 10.0.1.11). This IP address forms the first component of the relative address.

Next, the NIC checks the JobID and PIDonFEP to resolve which job and which process the message is intended for. These two fields are the second and third components of the relative address { FA, JobID, PIDonFEP }.

The fourth component is the Resource Index (RI) table descriptor. It tells the NIC which RI table should be consulted for the memory lookup.

Finally, the rkey, although not part of the relative addressing tuple itself, selects the specific entry in that RI table that defines the precise remote memory region. In our example, the complete addressing information is:

{ FA: 10.0.1.11, JobID: 101, PIDonFEP: 0, RI: 0x00a }, and the rkey identifies the specific RI entry to use.



Figure 5-6: Ultra Ethernet: Relative Addressing for Distributed Learning.


Comparison with VXLAN

Relative addressing in UET has several structural similarities to addressing in VXLAN data planes.

A Fabric Address (FA) attached to a Fabric Endpoint (FEP) serves a role similar to a VTEP IP address in a VXLAN fabric. Both identify the tunnel endpoint used to route the packet across the underlay network toward its destination.

A JobID identifies a distributed job that consists of multiple processes. In VXLAN, the Layer-2 VNI (L2VNI) identifies a stretched Layer-2 segment for endpoints. In both technologies, these identifiers define the logical communication context in which the packet is interpreted.

The combination of PIDonFEP and RI tells the UE-NIC which Resource Index table describes the target memory locations owned by that process. Similarly, in VXLAN, the VNI-to-VLAN mapping on a VTEP determines which MAC address table holds the forwarding entries for that virtual network.

The rkey selects the specific entry in the RI table that defines the exact target memory location. The VXLAN equivalent is the destination MAC address, which selects the exact entry in the MAC table that determines the egress port or remote VTEP.

Figure 5-7 further illustrates this analogy. The Tunnel IP address determines the target VTEP, and the L2VNI-to-VLAN mapping on that VTEP identifies which MAC address table should be consulted for the destination MAC in the original Ethernet header. As a reminder, VXLAN is an Ethernet-in-IP encapsulation method where the entire Layer-2 frame is carried inside an IP/UDP tunnel.


Figure 5-7: Virtual eXtensible LAN: Layer2 Virtual Network Identifier.



Monday, 24 November 2025

UET Data Transfer Operation: Work Request Entity and Semantic Sublayer

Work Request Entity (WRE) 


The UET provider constructs a Work Request Entity (WRE) from a fi_write RMA operation that has been validated and passed by the libfabric core. The WRE is a software-level representation of the requested transfer and semantically describes both the source memory (local buffer) and the target memory (remote buffer) for the operation. Using the WRE, the UET provider constructs the Semantic Sublayer (SES) header and the Packet Delivery Context (PDC) header.

From the local memory perspective, the WRE specifies the address of the data in registered local memory, the length of the data, and the local memory key (lkey). This information allows the NIC to fetch the data directly from local memory when performing the transmission.

From the target memory perspective, the WRE describes the Resource Index (RI) table, which contains information about the destination memory region, including its base address and the offset within that region where the data should be written. The RI table also defines the allowed operations on the region. Because an RI table may contain multiple entries, the actual memory region is selected using the rkey, which is also included in the WRE. The rkey enables the remote NIC to locate the correct memory region within the selected RI table.

To ensure proper delivery, the WRE includes an Address Vector (AV) table entry, identified via the fi_addr_t handle. The AV provides the Fabric Address (FA) of the target and specifies which job and which rank (i.e., PIDonFEP) the data is intended for. The WRE also indicates whether the completion of the transport operation should be reported through a completion queue.

By including pointers to the AV table entry and the remote RI table, the WRE allows the UET provider to access all the transport and remote memory metadata required for the operation without duplicating the underlying AV or RI data structures. Using these indices, the UET provider can efficiently construct the SES and PDC headers for the Work Element (WE), ensuring correct delivery of the data from the initiator’s local memory to the remote target memory.

Figure 5-4 illustrates how the libfabric core passes a fi_write RMA operation request from the application to the UET provider after performing a sanity check. The UET provider then constructs the Work Request Entity, which encapsulates all the information about the local and remote memory, the AV entry identified by the fi_addr_t handle, and the transport metadata required to deliver the operation.

Figure 5-4: RMA Operation – Semantic Sublayer: Work Request Entity (WRE).

Semantic Sublayer (SES)


The UET provider’s Semantic Sublayer (SES) maps application-facing API calls, such as fi_write RMA operations, to UET operations. In our example, the UET provider constructs a SES header where the fi_write request from the libfabric application is mapped to a UET_WRITE operation. The first step of this mapping, described in the previous section, uses the content of the fi_write request to construct a UET Work Request Entity (WRE). Information from the WRE is then used to build the actual SES header, which will later be wrapped within Ethernet/IP/UDP (if an Entropy header is not used). The SES header carries information that the target NIC uses to resolve which job and process the message is targeted to, and which specific memory location the data should be written to. In other words, the SES header provides a form of intra-node routing, indicating where the data should be placed within the target node.

The first element to consider in Figure 5‑5 is the maximum message size for the endpoint. The NIC is abstracted as a Domain object, to which the endpoint is associated. In our example, the NIC has two Ethernet ports, each with an egress buffer size of 8192 bytes. A shared staging buffer describes the maximum number of bytes that can be queued for transmission through the egress ports. Because the data to be written to the target memory (16 384 bytes) exceeds the staging buffer size (8192 bytes), the UET provider must split the operation into two messages, each with a slightly different SES header.

The first key field in the SES header is the operation code (UET_WRITE), which instructs the target NIC how to handle the received data. The rel bit (relative addressing), when set, indicates that the operation uses a relative address, which includes JobID (101), PIDonFEP (process identifier, rank 2 in our example), and Resource Index (0x00a). Based on this information, the target NIC can identify the job and rank to which the message belongs. The Resource Index may contain multiple entries and the rkey (0xacce5) is used to select the correct RI entry.

The SES header also contains the buffer offset, which specifies the exact location relative to the base address where the data should be written. In Figure 5‑5, the first message will write data starting at offset 0, while the second message will write at offset 8192, immediately following the first message’s payload. The ses.som bit indicates the start of the message, and ses.eom indicates the end of the message. Note that ses.som and ses.eom bits are not used for ordering; message ordering is ensured by the Message ID field, which allows the NIC to process fragments in the correct sequence. 

During the job lifetime, Resource Indices may be updated. To validate that the correct version is used, the SES header includes the ri_generation field, which identifies the initiator’s current RI table version.

The hd (header data present) bit indicates whether the header_data field is presented in the SES header. A common use case for header data is when a GPU holds multiple gradient chunks that must be synchronized with a remote process. Each chunk can be identified by a bucket ID stored in the SES header’s header_data field. For example, the first chunk in GPU 0 memory may have bucket_id=11, the second chunk bucket_id=12, and so on. This allows the NIC to distinguish which messages correspond to which chunk. The initiator id describes the rank id (initiator’s PIDonFEP).

If a gradient chunk exceeds the NIC’s max_msg_size, it must be split into multiple SES messages. Consider the second chunk (bucket_id=12) split into four messages. The first message has ses.som=1, indicating the start of the chunk, and hd=1, signaling that header data is present. The header_data field contains the bucket ID (12). This message also has message_id=1, identifying it as the first SES message of the chunk. The next two messages have message_id=2 and message_id=3, respectively. Both have hd=0, ses.som=0, and ses.eom=0, indicating they are continuation packets. The fourth message is similar but has ses.eom=1, marking it as the last message of the chunk. 



Figure 5-5: RMA Operation – Semantic Sublayer: SES Header.


Monday, 17 November 2025

UET Data Transfer Operation: Introduction

Introduction

[Updated 22 November 2025: Handoff Section]

The previous chapter described how an application gathers information about available hardware resources and uses that information to initialize the job environment. During this initialization, hardware resources are abstracted and made accessible to the UET provider as objects.

This chapter explains the data transport process, using gradient synchronization as an example.

Figure 5-1 depicts two GPUs—Rank 0 and Rank 2—participating in the same training job (JobID: 101). Both GPUs belong to the same NCCL topology and are connected to the Scale-Out Backend Network’s rail0.

Because the training model is large, each layer of neural network is split across two GPUs using tensor parallelism, meaning that the computations of a single layer are distributed between GPUs. 

During the first forward-pass training iteration, the predicted model output does not match the expected result. This triggers the backward pass process, in which gradients—values indicating how much each weight parameter should be adjusted to improve the next forward-pass prediction—are computed.

Rank 0 computes its gradients, which in Figure 5-1 are stored as a 2D matrix with 3 rows and 1024 columns. The results are stored in a memory space registered for the process in local VRAM. The memory region’s base address is 0x20000000.

The first gradient at row 0, column 0 (index [0,0]) is stored at offset 0. In this example, each gradient value is 4 bytes. Thus, the second gradient at [0,1] is stored at offset 4, and so on. All 1024 gradients in row 0 require 4096 bytes (1024 × 4 bytes) of memory, which corresponds to the Scale-Out Backend Network’s Maximum Transfer Unit (MTU). The entire gradient block stored in Rank 0’s VRAM occupies 12,280 bytes. 

Row 0: [G[0,0], G[0,1], G[0,2], ..., G[0,1023]] → offsets 0, 4, 8, ..., 4092 bytes

Row 1: [G[1,0], G[1,1], G[1,2], ..., G[1,1023]] → offsets 4096, 4100, 4104, ..., 8192 bytes

Row 2: [G[2,0], G[2,1], G[2,2], ..., G[2,1023]] → offsets 8192, 8196, 8200, ..., 12288 bytes


After completing its part of the gradient computation, the application on Rank 0 initiates gradient synchronization with Rank 2 by executing an fi_write RMA operation. This operation writes the gradient data from local memory to remote memory. To perform this operation successfully, the application—together with the UET provider, libfabric core, and the UET NIC—must provide the necessary information for the process, UET NIC, and network:

Semantic (What we want to do): The application describes the intent: Write 12,280 bytes of data from the local registered memory region stared at base memory address 0x20000000 to the corresponding memory region of a process running on Rank 2.

Delivery (How to transport): The application specifies how the data must be delivered: reliable or unreliable transport, with ordered or unordered packet delivery. In figure 5-1, the selected mode is Reliable, Ordered Delivery (ROD).

Forwarding (Where to transport): To route the packets over the Scale-Out Backend Network, the delivery information is encapsulated within Ethernet, IP, and optionally UDP headers.


Figure 5-1: High-Level View of Remote Memory Access Operation.



Application: RMA Write Operation


Figure 5-2 illustrates how an application gathers information for an fi_write RMA operation.

The first field, fid_ep, maps the remote write operation to the endpoint object (fid_ep = 0xFIDAE01; AE stands for Active Endpoint). The endpoint type is FI_EP_RDM, which provides reliable datagram delivery. The endpoint is bound to a registered memory region (fid_mr = 0xFIDDA01), and the RMA operation stores the memory descriptor in the desc field. Gradients reside in this memory region, starting from the base address 0x20000000.

The length field specifies how many bytes should be written to the remote memory. The target of the write is represented by an fi_addr_t value. In this example, it points to the first entry in the Address Vector (AV), which identifies the remote rank and its fabric address (IP address). The AV also references a Resource Index (RI) entry. The RI entry contains the JobID, rank ID, remote memory address, and key required for access.

After collecting all the necessary information, the application invokes the fi_write RMA operation in the libfabric core.

Figure 5-2: RMA Operation – Application: fi_write.

Libfabric Core Validation for RMA Operations


Libfabric core validation ensures that an RMA request is structurally correct, references valid objects, and complies with the capabilities negotiated during endpoint creation. It performs three main types of lightweight checks before handing off the request to the UET provider.

Object Integrity and Type Validation


The core first ensures that all objects referenced by the application are valid and consistent. For fi_write, it verifies that fid_ep is a properly initialized endpoint of the expected class (FI_CLASS_EP) and type (FI_EP_RDM). The memory region fid_mr is checked to confirm it is registered and that the desc field correctly points to the local buffer. The target fi_addr_t is validated against the Address Vector (AV) to ensure it corresponds to a valid remote rank and fabric address. Any associated Resource Index (RI) references are verified for consistency.
Attribute and Capability Compliance

Next, the core verifies that the endpoint supports the requested operation. For fi_write, this means checking that the endpoint provides reliable RMA write capability. It also ensures that any attributes or flags used are compatible with the endpoint’s capabilities, including alignment with memory registration and compliance with the provider’s supported ordering and reliability semantics.

Basic Parameter Sanity Checks 


The core performs sanity checks on operation parameters. It verifies that length is non-zero, does not exceed the registered memory region, and is correctly aligned. Any flags or optional parameters are checked for validity.

Provider Handoff


After passing all validations, the libfabric core hands off the fi_write operation to the provider, ensuring the request is well-formed.

The UET provider converts the fi_write RMA operation into a Work Request Element (WRE), from which it creates a Semantic Sublayer (SES) header that specifies where the data will be written on the target, and a Packet Delivery Context (PDC) that describes how the packet is expected to be delivered. It then constructs the Data Descriptor (DD), which includes the local memory address, length, and access key. Next, the UET provider creates a Work Element (WE) from the SES, PDC, and DD. The WE is placed into the NIC’s staging buffer, where the NIC reads it and constructs a packet, copying the data from the local memory.

These processes are explained in the upcoming chapters.


Figure 5-3: RMA Operation – Libfaric Core: Lightweight Sanity Check.


Monday, 10 November 2025

UET Data Transport Part I: Introduction

[Figure updated 13 November 2025]

My previous UET posts explained how an application uses libfabric function API calls to discover available hardware resources and how this information is used to create a hardware abstraction layer composed of Fabric, Domain, and Endpoint objects, along with their child objects — Event Queues, Completion Queues, Completion Counters, Address Vectors, and Memory Regions.

This chapter explains how these objects are used during data transfer operations. It also describes how information is encoded into UET protocol headers, including the Semantic Sublayer (SES) and Packet Delivery Sublayer (PDC). In addition, the chapter covers how the Congestion Management Sublayer (CMS) monitors and controls send queue rates to prevent egress buffer overflows.

Note: In this book, libfabric API calls are divided into two categories for clarity. Functions are used to create and configure fabric objects such as fabrics, domains, endpoints, and memory regions (for example, fi_fabric(), fi_domain(), and fi_mr_reg()). Operations, on the other hand, perform actual data transfer or synchronization between processes (for example, fi_write(), fi_read(), and fi_send()).

Figure 5-1 provides a high-level overview of a libfabric Remote Memory Access (RMA) operation using the fi_write function call. When an application needs to transfer data, such as gradients, from its local memory to the memory of a GPU on a remote node, both the application and the UET provider must specify a set of parameters. These parameters ensure that the local RMA-capable NIC can forward packets to the correct destination and that the target node can locate the appropriate memory region using its process and job identifiers.

First, the application defines the operation to perform, in our example, a remote write fi_write(). It then specifies the resources involved in the transfer. The endpoint (fid_ep) represents the communication interface between the process and the underlying fabric. Each endpoint is bound to exactly one domain object, which abstracts the UET NIC. Through this binding, the UET provider automatically knows which NIC the endpoint uses, and the endpoint is automatically assigned to one or more send queues for processing work requests. This means the application does not need to manage NIC queue assignments manually.

Next, the application identifies the registered memory region (desc) that contains the local data to be transmitted. It also specifies where within that region to start reading the payload (buffer pointer: buf) and how many bytes to transfer (length: len). 

To reach the correct remote peer, the application uses a fabric address handle (fi_addr_t). The provider resolves this logical address through its Address Vector (AV) to obtain the peer’s actual fabric address—corresponding, in the UET context, to the remote UET NIC endpoint.

Finally, the application specifies the destination memory information: the remote virtual (addr) address where the data should be written and the remote protection key, which authorizes access to that memory region. 

The resulting fi_write function call, as described in the libfabric programmer’s manual, is structured as follows:

ssize_t fi_write(struct fid_ep *ep, const void *buf, size_t len, void *desc, fi_addr_t dest_addr, uint64_t addr, uint64_t key, void *context);

Next, the application fi_write operation API call is passed by the libfabric core to the UET provider. Based on the fi_addr_t handle, the provider knows which Address Vector (AV) table entry it should consult. In our example, the handle value 0x0001 corresponds to rank 1 with the fabric address 10.0.1.11.

Depending on the provider implementation, an AV entry may optionally reference a Resource Index (RI) entry. The RI table can associate the JobID with the work request and store an authorization key, if it was not provided directly by the application. It may also define which operations are permitted for the target rank.

Note: The Rank Identifier (RankID) can be considered analogous to a Process Identifier (PID); that is, the RankID defines the PID on the Fabric EndPoint (FEP).

Armed with this information, gathered from the application’s fi_write operation call and from the Address Vector and Resource Index tables, the UET provider creates a Work Request (WR) and places it into the Send Queue (SQ). Each SQ is implemented as a circular buffer in memory, shared between the GPU (running the provider) and the NIC hardware. Writing a WR into the queue does not automatically notify the NIC. To signal that new requests are available, the provider performs a doorbell operation, writing to a special NIC register. This alerts the NIC to read the SQ, determine how many WRs are pending, and identify where to start processing.

Once notified, the NIC fetches each WR, retrieves the associated metadata, such as the destination fabric address, remote memory key, and SES/PDC header information — and begins executing the data transfer. Some NICs may also periodically poll the SQ, but modern UET NICs typically rely on doorbell notifications to achieve low-latency execution.

Because GPUs and the application are multithreaded, multiple operations may be posted to the SQ simultaneously. Each WR is treated independently and can be placed in separate send queues, allowing the NIC to execute multiple transfers in parallel. This design ensures efficient utilization of both the NIC and GPU resources while maintaining correct ordering and authorization of each operation.


Figure 5-1: Mapping between libfaric operation, Provider Objects, and Hardware.

Before transmitting packets, the NIC uses the metadata retrieved from the AV and RI tables to construct the necessary protocol headers. The Semantic Sublayer (SES) header is created using information such as the JobID, process context, and authorization key, ensuring that the remote peer can correctly identify and authorize the operation. Simultaneously, the Packet Delivery Sublayer (PDC) header is prepared to control reliable delivery, sequence numbering, and congestion management. Together, these headers allow the NIC to send the payload efficiently and securely, while preserving the correct association with the source operation and enabling proper handling by the remote UET NIC.

Next, we will examine in detail how the UET headers — SES and PDC — are constructed and encapsulated with Ethernet/IP headers, and optionally UDP headers for entropy, so that packets can be efficiently routed by the scale-out backend network switches to the correct target node for further processing. On the sending side, the PDC header provides context that the UET NIC uses to manage reliable delivery, sequence numbering, and congestion control, ensuring that packets are transmitted correctly and in order to the remote peer. On the receiving side, the SES header carries the operation-specific information that tells the remote UET NIC exactly what to do — in our example, it instructs the UET NIC to WRITE a block of data to a memory address registered with target process, participating in JobID 101.