Monday, 15 December 2025

UET Request–Response Packet Flow Overview

 This section brings together the processes described earlier and explains the packet flow from the node perspective. A detailed network-level packet walk is presented in the following sections..

Initiator – SES Request Packet Transmission

After the Work Request Entity (WRE) and the corresponding SES and PDS headers are constructed, they are submitted to the NIC as a Work Element (WE). As part of this process, a Packet Delivery Context (PDC) is created, and the base Packet Sequence Number (PSN) is selected and encoded into the PDS header.

Once the PDC is established, it begins tracking acknowledged PSNs from the target. For example, the PSN 0x12000 is marked as transmitted. 

The NIC then fetches the payload data from local memory according to the address and length information in the WRE. The NIC autonomously performs these steps without CPU intervention, illustrating the hardware offload capabilities of UET.

Next, the NIC encapsulates the data with the required protocol headers: Ethernet, IP, optional UDP, PDS, and SES, and computes the Cyclic Redundancy Check (CRC). The fully formed packet is then transmitted toward the target with Traffic Class (TC) set to Low.

Note: The Traffic Class is orthogonal to the PDC; a single PDC may carry packets transmitted with Low or High TC depending on their role (data vs control).

Figure 5-9: Initiator: SES Request Processing.

Target – SES Request Reception and PDC Handling


Figure 5-10 illustrates the target-side processing when an PDS Request carrying SES Request is received. Unlike the initiator, the target PDS manager identifies the PDC using the tuple {source IP address, destination IP address, Source PDC Identifier (SPDCID)} to perform a lookup in its PDC mapping table.


Because no matching entry exists, the lookup results in a miss, and the target creates a new PDC. The PDC identifier (PDCID) is allocated from the General PDC pool, as indicated by the DPDCID field in the received PDS header. In this example, the target selects PDCID 0x8001.

This PDCID is subsequently used as the SPDCID when sending the PDS Ack  Response (carrying Semantic Response) back to the initiator. Any subsequent PDS Requests from the initiator reference this PDC using the same DPDCID = 0x8001, ensuring continuity of the PDC across messages.

After the PDC has been created, the UET NIC writes the received data into memory according to the SES header information. The memory placement process follows several steps:

  • Job and rank identification: The relative address in the SES header identifies the JobID (101) and the PIDonFEP (RankID 2).
  • Resource Index (RI) table lookup: The NIC consults the RI table, indexed by 0x00a, and verifies that the ri_generation field (0x01) matches the current table version. This ensures the memory region is valid and has not been re-registered.
  • Remote key validation: The NIC uses the rkey = 0xacce5 to locate the correct RI table entry and confirm permissions for writing.
  • Data placement: The data is written at base address (0xba5eadd1) + buffer_offset (0). The buffer_offset allows fragmented messages to be written sequentially without overwriting previous fragments.

In Figure 5-10, the memory highlighted in orange shows the destination of the first data fragment, starting at the beginning of the registered memory region.

Note: The NIC handles all these steps autonomously, performing direct memory placement and verification, which is essential for high-performance, low-latency applications like AI and HPC workloads.

Figure 5-10: Target: Request Processing – NIC → PDS → SES → Memory.


Target – SES Response, PDS Ack Response and Packet Transmission

After completing the write operation, the UET provider uses Semantic Response (SES Response) to notify the initiator that the operation was successful. The opcode in the SES Response header is set to UET_DEFAULT_RESPONSE, with list= UET_EXPECTED and return_code = RC_OK, indicating that the UET_WRITE operation has been executed successfully and the data has been written to target memory. Other fields, including message_id, ri_generation, JobID, and modified_length, are filled with the same values received in the SES Request, for example, message_id = 1, ri_generation = 0x001, JobID = 101, and modified_length = 16384.

Once the SES Response header is constructed, the UET provider creates a PDS Acknowledgement (PDS Ack) Response. The type is set to PDS_ACK, and the next_header field UET_HDR_RESPONSE references the SES Response type. The ack_psn_offset encodes the PSN from the received PDS Request, while the cumulative PSN (cack_psn) acknowledges all PDS Requests up to and including the current packet. The SPDCID is set to the target’s Initial PDCID (0x8001), and the DPDCID is set to the value received from the PDS Request as SPDCID (0x4001).

Finally, the PDS Ack and SES Response headers are encapsulated with Ethernet, IP, and optional UDP headers and transmitted by the NIC using High Traffic Class (TC). The High TC ensures that these control and acknowledgement messages are prioritized in the network, minimizing latency and supporting reliable flow control.

Figure 5-11: Target: Response Processing – SES → PDS → Transmit.

Initiator – SES Response and PDS Ack Respond


When the initiator receives a PDS Ack Response that also carries a SES Response, it first identifies the associated Packet Delivery Context (PDC) using the DPDCID field in the PDS header. Using this PDC, the initiator updates its PSN tracking state. The acknowledged PSN—for example, 0x12000—is marked as completed and released from the retransmission tracking state, indicating that the corresponding PDS Request has been successfully delivered and processed by the target.

After updating the transport-level state, the initiator extracts the SES Response and passes it to the Semantic Sublayer (SES) for semantic processing. The SES layer evaluates the response fields, including the opcode and return code, and determines that the UET_WRITE operation associated with message_id = 1 has completed successfully. As this response corresponds to the first fragment of the message, the initiator can mark that fragment as completed and, depending on the message structure, either wait for additional fragment responses or complete the overall operation. In our case, there are three more fragments to be processed.

This separation of responsibilities allows the PDS layer to manage reliability and delivery tracking, while the SES layer handles operation-level completion and status reporting.

Figure 5-12: Initiator: PDS Response & PDS Ack Processing.

Note: PDS Requests and Responses describe transport-specific parameters, such as the delivery mode (Reliable Unordered Delivery, RUD, or Reliable Ordered Delivery, ROD). In contrast, SES Requests and Responses describe semantic operations. SES Requests specify what action the target must perform, for example, writing data and the exact memory location for that operation, while SES Responses inform the initiator whether the operation completed successfully. In some flow diagrams, SES messages are shown as flowing between the SES and PDS layers, while PDS messages are shown as flowing between the PDS layers of the initiator and the target.



Wednesday, 10 December 2025

UET Protocol: How the NIC constructs packet from the Work Entries (WRE+SES+PDS)

 Semantic Sublayer (SES) Operation 

[Rewritte 12. Dec-2025]

After a Work Request Entity (WRE) is created, the UET provider generates the parameters needed by the Semantic Sublayer (SES) headers. At this stage, the SES does not construct the actual wire header. Instead, it provides the header parameters, which are later used by the Packet Delivery Context (PDC) state machine to construct the final SES wire header, as explained in the upcoming PDC section. These parameters ensure that all necessary information about the message, including addressing and size, is available for later stages of processing.

Fragmentation Due to Guaranteed Buffer Limits

In our example, the data to be written to the remote GPU is 16 384 bytes. The dual-port NIC in figure 5-5 has a total memory capacity of 16 384 bytes, divided into three regions: a 4 096-byte guaranteed per-port buffer for Eth0 and Eth1, and an 8 192-byte shared memory pool available to both ports. Because gradient synchronization requires lossless delivery, all data must fit within the guaranteed buffer region. The shared memory pool cannot be used, as its buffer space is not guaranteed.

Since the message exceeds the size of the guaranteed buffer, it must be fragmented. The UET provider splits the 16 384-byte message into four 4 096-byte sub-messages, as illustrated in Figure 5‑6. Fragmentation ensures that each piece of the message can be transmitted reliably within the available guaranteed buffer space.

Figure 5‑5 also illustrates the main object abstractions in UET. The Domain object represents and manages the NIC, while the Endpoint defines a logical connection between the application and the NIC. The max_msg_size parameter specifies the largest data payload that can be transferred in a single operation. In our example, the NIC’s egress buffer can hold 4 096 bytes, but the application needs to write 16 384 bytes. As a result, the data must be split into multiple smaller chunks, each fitting within the buffer and max_msg_size limits.


Figure 5-5: RMA Operation – NIC’s Buffers and Data Size.

Core SES Parameters


Each of the four fragments in Figure 5-6 carries the same msg_id = 1, allowing the target to recognize them as parts of the same original message. The first fragment sets ses.som = 1 (start of message), while the last fragment sets ses.eom = 1 (end of message). The two middle fragments have both ses.som and ses.eom set to 0. In addition to these boundary markers, the SES parameters also define the source and destination FEPs, the Delivery Mode (RUD – Reliable Unordered Delivery for an fi_write operation), the Job ID (101), the Traffic Class (Low = 0), and the Packet Length (4 096 bytes). The pds.next_hdr field determines which SES base header format the receiver must parse next. For an fi_write operation, a standard SES Request header is used (UET_HDR_REQUEST_STD = 0x3).



Figure 5-6: RMA Operation – Semantic Sublayer: SES Parameters.


Packet Deliver Sublayer (PDS) Operation

PDS Manager


Once the SES parameters are defined, the provider issues the ses_pds_tx_req() request to the PDS Manager (Figure 5-7). The PDS Manager examines the tuple {Job ID, Destination FEP, Traffic Class, Request Mode} to determine whether a Packet Delivery Context (PDC) already exists for that combination.

Because this request corresponds to the first of the four sub-messages, no PDC exists yet. The PDS Manager selects the first available PDC Identifier from the pre-allocated PDC pool. The pool—created during job initialization—consists of a general PDC pool (used for operations such as fi_write and fi_send) and a reserved PDC pool for high-priority traffic such as Ack messages from the target to the initiator. In our example, the first free PDC ID from the general pool is 0x4001.

After opening the PDC, the PDS Manager associates all subsequent requests with the same msg_id = 101 with PDC 0x4001.

PDC State Machine


After selecting a PDC Identifier, the PDS Manager forwards the request to the PDC State Machine. This component assigns the base Packet Sequence Number (PSN), which will be used in the PSN field of the first and upcoming fragments. Next, it constructs the PDS and SES headers for the message.

Constructing PDC Header


The Type field in the PDS header in Figure 5-7 is set to 2, indicating Reliable Unordered Delivery (RUD). Reliability is ensured by the Acknowledgement Required (ar) flag, which is mandatory for this transport mode. The Next Header (pds.next_hdr) field specifies that the following header is the Standard SES Request header (0x3 = UET_HDR_REQUEST_STD)

The syn flag remains set until the first PSN is acknowledged by the target (explained in upcoming chapters). Using the PSNs reported in ACK messages, the initiator can determine which packets have been successfully delivered.
A dynamically established PDC works as a communication channel between endpoints. For the channel to operate, both the initiator and the target must use the same PDC type (General or Reserved) and must agree on which local and remote PDCs are used for the exchange. In Figure 5-7, the Source PDC Identifier (SPDCID) in PDS header is derived from the Initiator PDCID (IPDCID). At this stage, the initiator does not know the Destination PDCID (DPDCID) because the communication channel is still in the SYN state (initializing). Instead of setting a DPDCID value, the initiator simply indicates the type of PDC the target should create for the connection (pdc = rsv). It also specifies the PSN offset (psn_offset = 0), indicating that the offset for this request is zero (i.e., using the base PSN value). The target opens its PDC for the connection after receiving the first packet and when generating the acknowledgment response.


Constructing SES Header


The SES header (Figure 5-7) provides the target NIC with information about which job and which process the incoming message is destined for. It also carries a Resource Index table pointer (0x00a) and an rkey (0xacce5) that together identify the entry in the RI table where the NIC can find the target’s local memory region for writing the received data.

The Opcode field (UET_WRITE) instructs the NIC how to handle the data. The rel flag (relative addressing) is set for parallel jobs to indicate that relative addressing mode is used.

Using the relative addressing mode, the NIC resolves the memory location by traversing a hierarchical address structure: Fabric Address (FA), Job ID, Fabric Endpoint (FEP), Resource Index (RI).

  • The FA, carried as the destination IP address in the IP header, selects the correct FEP when the system has multiple FEPs.
  • The Job ID identifies the job to which the packet belongs.
  • The PIDonFEP identifies the process participating in the job (for parallel jobs, PIDonFEP = rankID).
  • The RI value points to the correct entry in the RI table.

Relative addressing identifies which RI table belongs to the job on the target FEP, but not which entry inside that table. The rkey in the SES header provides this missing information: it selects the exact RI table entry that describes the registered memory region.

While the rkey selects the correct RI entry, the buffer_offset field specifies where inside the memory region the data should be written, relative to the region’s base address. In Figure 5-8, the first fragment writes at offset 0, and the second fragment (not shown) starts at offset 4 096, immediately after the first payload.

The ri_generation field (e.g., 0x01) indicates the version of the RI table in use. This is necessary because the table may be updated while the job is running. The hd (header-data-present) bit indicates whether the header_data field is included. This is useful when multiple gradient buckets must be synchronized, because each bucket can be identified by an ID in header_data (for example, GPU 0 might use bucket_id = 11 for the first chunk and bucket_id = 12 for the second). The initiator_id field specifies the initiator’s PIDonFEP (i.e., rankID).

Finally, note that the SES Standard header has several variants. For example:

  • If ses.som = 1, a Header Data field is present.
  • If ses.eom = 1, the Payload Length and Message Offset fields are included.





Figure 5-8: Packet Delivery Sublayer: PDC and PDS Header.

Work Element (WE)


Once the SES and PDS headers are created, they are inserted, along with the WRE, into the NIC’s Transmit Queue (TxQ) as a Work Element (WE). Figure 5‑9 illustrates the three components of a WE. The NIC fetches the data from local memory based on the instructions described in the WRE, wraps it with Ethernet, IP, optional UDP, PDS, and SES headers, and calculates the Cyclic Redundancy Check (CRC) for unencrypted packets. The packet is then ready for transmission.

The NIC can support multiple Transmit Queues. In our example, there are two: one for Traffic Class Low and another for Traffic Class High. The example WE is placed into the Low TxQ. Work Element 1 (WE1) corresponds to the first ses_pds_tx_req() request, completing the end-to-end flow from WRE creation to packet transmission.


Figure 5-9: UET NIC: Packetization, Queueing & Transport.



Thursday, 27 November 2025

UET Relative Addressing and Its Similarities to VXLAN

 Relative Addressing


As described in the previous section, applications use endpoint objects as their communication interfaces for data transfer. To write data from local memory to a target memory region on a remote GPU, the initiator must authorize the local UE-NIC to fetch data from local memory and describe where that data should be written on the remote side.

To route the packet to the correct Fabric Endpoint (FEP), the application and the UET provider must supply the FEP’s IP address (its Fabric Address, FA). To determine where in the remote process’s memory the received data belongs, the UE-NIC must also know:

  • Which job the communication belongs to
  • Which process within that job owns the target memory
  • Which Resource Index (RI) table should be used
  • Which entry in that table describes the exact memory location

This indirection model is called relative addressing.

How Relative Addressing Works

Figure 5-6 illustrates the concept. Two GPUs participate in distributed training. A process on GPU 0 with global rank 0 (PID 0) receives data from GPU 1 with global rank 1 (PID 1). The UE-NIC determines the target Fabric Endpoint (FEP) based on the destination IP address (FA = 10.0.1.11). This IP address forms the first component of the relative address.

Next, the NIC checks the JobID and PIDonFEP to resolve which job and which process the message is intended for. These two fields are the second and third components of the relative address { FA, JobID, PIDonFEP }.

The fourth component is the Resource Index (RI) table descriptor. It tells the NIC which RI table should be consulted for the memory lookup.

Finally, the rkey, although not part of the relative addressing tuple itself, selects the specific entry in that RI table that defines the precise remote memory region. In our example, the complete addressing information is:

{ FA: 10.0.1.11, JobID: 101, PIDonFEP: 0, RI: 0x00a }, and the rkey identifies the specific RI entry to use.



Figure 5-6: Ultra Ethernet: Relative Addressing for Distributed Learning.


Comparison with VXLAN

Relative addressing in UET has several structural similarities to addressing in VXLAN data planes.

A Fabric Address (FA) attached to a Fabric Endpoint (FEP) serves a role similar to a VTEP IP address in a VXLAN fabric. Both identify the tunnel endpoint used to route the packet across the underlay network toward its destination.

A JobID identifies a distributed job that consists of multiple processes. In VXLAN, the Layer-2 VNI (L2VNI) identifies a stretched Layer-2 segment for endpoints. In both technologies, these identifiers define the logical communication context in which the packet is interpreted.

The combination of PIDonFEP and RI tells the UE-NIC which Resource Index table describes the target memory locations owned by that process. Similarly, in VXLAN, the VNI-to-VLAN mapping on a VTEP determines which MAC address table holds the forwarding entries for that virtual network.

The rkey selects the specific entry in the RI table that defines the exact target memory location. The VXLAN equivalent is the destination MAC address, which selects the exact entry in the MAC table that determines the egress port or remote VTEP.

Figure 5-7 further illustrates this analogy. The Tunnel IP address determines the target VTEP, and the L2VNI-to-VLAN mapping on that VTEP identifies which MAC address table should be consulted for the destination MAC in the original Ethernet header. As a reminder, VXLAN is an Ethernet-in-IP encapsulation method where the entire Layer-2 frame is carried inside an IP/UDP tunnel.


Figure 5-7: Virtual eXtensible LAN: Layer2 Virtual Network Identifier.



Monday, 24 November 2025

UET Data Transfer Operation: Work Request Entity and Semantic Sublayer

Work Request Entity (WRE) 

[SES part updated 7-Decembr 2025: text and figure] 

The UET provider constructs a Work Request Entity (WRE) from a fi_write RMA operation that has been validated and passed by the libfabric core. The WRE is a software-level representation of the requested transfer and semantically describes both the source memory (local buffer) and the target memory (remote buffer) for the operation. Using the WRE, the UET provider constructs the Semantic Sublayer (SES) header and the Packet Delivery Context (PDC) header.

From the local memory perspective, the WRE specifies the address of the data in registered local memory, the length of the data, and the local memory key (lkey). This information allows the NIC to fetch the data directly from local memory when performing the transmission.

From the target memory perspective, the WRE describes the Resource Index (RI) table, which contains information about the destination memory region, including its base address and the offset within that region where the data should be written. The RI table also defines the allowed operations on the region. Because an RI table may contain multiple entries, the actual memory region is selected using the rkey, which is also included in the WRE. The rkey enables the remote NIC to locate the correct memory region within the selected RI table.

To ensure proper delivery, the WRE includes an Address Vector (AV) table entry, identified via the fi_addr_t handle. The AV provides the Fabric Address (FA) of the target and specifies which job and which rank (i.e., PIDonFEP) the data is intended for. The WRE also indicates whether the completion of the transport operation should be reported through a completion queue.

By including pointers to the AV table entry and the remote RI table, the WRE allows the UET provider to access all the transport and remote memory metadata required for the operation without duplicating the underlying AV or RI data structures. Using these indices, the UET provider can efficiently construct the SES and PDC headers for the Work Element (WE), ensuring correct delivery of the data from the initiator’s local memory to the remote target memory.

Figure 5-4 illustrates how the libfabric core passes a fi_write RMA operation request from the application to the UET provider after performing a sanity check. The UET provider then constructs the Work Request Entity, which encapsulates all the information about the local and remote memory, the AV entry identified by the fi_addr_t handle, and the transport metadata required to deliver the operation.

Figure 5-4: RMA Operation – Semantic Sublayer: Work Request Entity (WRE).

Semantic Sublayer (SES)


The UET provider’s Semantic Sublayer (SES) maps application-facing API calls, such as fi_write RMA operations, to UET operations. In our example, the UET provider constructs a SES header where the fi_write request from the libfabric application is mapped to a UET_WRITE operation. The first step of this mapping, described in the previous section, uses the content of the fi_write request to construct a UET Work Request Entity (WRE). Information from the WRE is then used to build the actual SES header, which will later be wrapped within Ethernet/IP/UDP (if an Entropy header is not used). The SES header carries information that the target NIC uses to resolve which job and process the message is targeted to, and which specific memory location the data should be written to. In other words, the SES header provides a form of intra-node routing, indicating where the data should be placed within the target node.

The first element to consider in Figure 5-5 is the maximum message size (max_msg_size) supported by the endpoint. In our example, the dual-port NIC has a total memory capacity of 16 384 bytes. Half of this memory (8 192 bytes) forms a shared pool accessible to both ports (Eth0 and Eth1). In addition, each port has a guaranteed private memory region of 4 096 bytes.

The maximum message size is determined by the largest amount of memory that the NIC can guarantee for a single message. Although the shared pool can temporarily absorb additional traffic, its availability cannot be guaranteed because both ports may consume it simultaneously. Consequently, the NIC must base max_msg_size solely on the per-port guaranteed memory. Thus, the largest message that the endpoint can safely handle is 4 096 bytes.

Note: Although an Accelerated NIC can fetch data directly from GPU VRAM without staging it through CPU memory, the NIC still needs the data briefly in its own buffer to perform transport framing (for example, adding headers and verifying CRC) before sending it onto the wire.

Because the data to be written to the target memory (16 384 bytes) exceeds the per-packet private buffer size (4 096 bytes), the UET provider must split the operation into four packets (messages).

The first key field in the SES header is the operation code (UET_WRITE), which instructs the target NIC how to handle the received data. The rel bit (relative addressing), when set, indicates that the operation uses a relative address, which includes JobID (101), PIDonFEP (process identifier, rank 2 in our example), and Resource Index (0x00a). Based on this information, the target NIC can identify the job and rank to which the message belongs. The Resource Index may contain multiple entries and the rkey (0xacce5) is used to select the correct RI entry.

The SES header also contains the buffer offset, which specifies the exact location relative to the base address where the data should be written. In Figure 5‑5, the first message will write data starting at offset 0, while the second message will write at offset 8192, immediately following the first message’s payload. The ses.som bit indicates the start of the message, and ses.eom indicates the end of the message. Note that ses.som and ses.eom bits are not used for ordering; message ordering is ensured by the Message ID field, which allows the NIC to process fragments in the correct sequence. 

During the job lifetime, Resource Indices may be updated. To validate that the correct version is used, the SES header includes the ri_generation field, which identifies the initiator’s current RI table version.

The hd (header data present) bit indicates whether the header_data field is presented in the SES header. A common use case for header data is when a GPU holds multiple gradient chunks that must be synchronized with a remote process. Each chunk can be identified by a bucket ID stored in the SES header’s header_data field. For example, the first chunk in GPU 0 memory may have bucket_id=11, the second chunk bucket_id=12, and so on. This allows the NIC to distinguish which messages correspond to which chunk. The initiator id describes the rank id (initiator’s PIDonFEP).

If a gradient chunk exceeds the NIC’s max_msg_size, it must be split into multiple SES messages. Consider the second chunk (bucket_id=12) split into four messages. The first message has ses.som=1, indicating the start of the chunk, and hd=1, signaling that header data is present. The header_data field contains the bucket ID (12). This message also has message_id=1, identifying it as the first SES message of the chunk. The next two messages have message_id=2 and message_id=3, respectively. Both have hd=0, ses.som=0, and ses.eom=0, indicating they are continuation packets. The fourth message is similar but has ses.eom=1, marking it as the last message of the chunk. 





Figure 5-5: RMA Operation – Semantic Sublayer: SES Header.


Monday, 17 November 2025

UET Data Transfer Operation: Introduction

Introduction

[Updated 22 November 2025: Handoff Section]

The previous chapter described how an application gathers information about available hardware resources and uses that information to initialize the job environment. During this initialization, hardware resources are abstracted and made accessible to the UET provider as objects.

This chapter explains the data transport process, using gradient synchronization as an example.

Figure 5-1 depicts two GPUs—Rank 0 and Rank 2—participating in the same training job (JobID: 101). Both GPUs belong to the same NCCL topology and are connected to the Scale-Out Backend Network’s rail0.

Because the training model is large, each layer of neural network is split across two GPUs using tensor parallelism, meaning that the computations of a single layer are distributed between GPUs. 

During the first forward-pass training iteration, the predicted model output does not match the expected result. This triggers the backward pass process, in which gradients—values indicating how much each weight parameter should be adjusted to improve the next forward-pass prediction—are computed.

Rank 0 computes its gradients, which in Figure 5-1 are stored as a 2D matrix with 3 rows and 1024 columns. The results are stored in a memory space registered for the process in local VRAM. The memory region’s base address is 0x20000000.

The first gradient at row 0, column 0 (index [0,0]) is stored at offset 0. In this example, each gradient value is 4 bytes. Thus, the second gradient at [0,1] is stored at offset 4, and so on. All 1024 gradients in row 0 require 4096 bytes (1024 × 4 bytes) of memory, which corresponds to the Scale-Out Backend Network’s Maximum Transfer Unit (MTU). The entire gradient block stored in Rank 0’s VRAM occupies 12,280 bytes. 

Row 0: [G[0,0], G[0,1], G[0,2], ..., G[0,1023]] → offsets 0, 4, 8, ..., 4092 bytes

Row 1: [G[1,0], G[1,1], G[1,2], ..., G[1,1023]] → offsets 4096, 4100, 4104, ..., 8192 bytes

Row 2: [G[2,0], G[2,1], G[2,2], ..., G[2,1023]] → offsets 8192, 8196, 8200, ..., 12288 bytes


After completing its part of the gradient computation, the application on Rank 0 initiates gradient synchronization with Rank 2 by executing an fi_write RMA operation. This operation writes the gradient data from local memory to remote memory. To perform this operation successfully, the application—together with the UET provider, libfabric core, and the UET NIC—must provide the necessary information for the process, UET NIC, and network:

Semantic (What we want to do): The application describes the intent: Write 12,280 bytes of data from the local registered memory region stared at base memory address 0x20000000 to the corresponding memory region of a process running on Rank 2.

Delivery (How to transport): The application specifies how the data must be delivered: reliable or unreliable transport, with ordered or unordered packet delivery. In figure 5-1, the selected mode is Reliable, Ordered Delivery (ROD).

Forwarding (Where to transport): To route the packets over the Scale-Out Backend Network, the delivery information is encapsulated within Ethernet, IP, and optionally UDP headers.


Figure 5-1: High-Level View of Remote Memory Access Operation.



Application: RMA Write Operation


Figure 5-2 illustrates how an application gathers information for an fi_write RMA operation.

The first field, fid_ep, maps the remote write operation to the endpoint object (fid_ep = 0xFIDAE01; AE stands for Active Endpoint). The endpoint type is FI_EP_RDM, which provides reliable datagram delivery. The endpoint is bound to a registered memory region (fid_mr = 0xFIDDA01), and the RMA operation stores the memory descriptor in the desc field. Gradients reside in this memory region, starting from the base address 0x20000000.

The length field specifies how many bytes should be written to the remote memory. The target of the write is represented by an fi_addr_t value. In this example, it points to the first entry in the Address Vector (AV), which identifies the remote rank and its fabric address (IP address). The AV also references a Resource Index (RI) entry. The RI entry contains the JobID, rank ID, remote memory address, and key required for access.

After collecting all the necessary information, the application invokes the fi_write RMA operation in the libfabric core.

Figure 5-2: RMA Operation – Application: fi_write.

Libfabric Core Validation for RMA Operations


Libfabric core validation ensures that an RMA request is structurally correct, references valid objects, and complies with the capabilities negotiated during endpoint creation. It performs three main types of lightweight checks before handing off the request to the UET provider.

Object Integrity and Type Validation


The core first ensures that all objects referenced by the application are valid and consistent. For fi_write, it verifies that fid_ep is a properly initialized endpoint of the expected class (FI_CLASS_EP) and type (FI_EP_RDM). The memory region fid_mr is checked to confirm it is registered and that the desc field correctly points to the local buffer. The target fi_addr_t is validated against the Address Vector (AV) to ensure it corresponds to a valid remote rank and fabric address. Any associated Resource Index (RI) references are verified for consistency.
Attribute and Capability Compliance

Next, the core verifies that the endpoint supports the requested operation. For fi_write, this means checking that the endpoint provides reliable RMA write capability. It also ensures that any attributes or flags used are compatible with the endpoint’s capabilities, including alignment with memory registration and compliance with the provider’s supported ordering and reliability semantics.

Basic Parameter Sanity Checks 


The core performs sanity checks on operation parameters. It verifies that length is non-zero, does not exceed the registered memory region, and is correctly aligned. Any flags or optional parameters are checked for validity.

Provider Handoff


After passing all validations, the libfabric core hands off the fi_write operation to the provider, ensuring the request is well-formed.

The UET provider converts the fi_write RMA operation into a Work Request Element (WRE), from which it creates a Semantic Sublayer (SES) header that specifies where the data will be written on the target, and a Packet Delivery Context (PDC) that describes how the packet is expected to be delivered. It then constructs the Data Descriptor (DD), which includes the local memory address, length, and access key. Next, the UET provider creates a Work Element (WE) from the SES, PDC, and DD. The WE is placed into the NIC’s staging buffer, where the NIC reads it and constructs a packet, copying the data from the local memory.

These processes are explained in the upcoming chapters.


Figure 5-3: RMA Operation – Libfaric Core: Lightweight Sanity Check.


Monday, 10 November 2025

UET Data Transport Part I: Introduction

[Figure updated 13 November 2025]

My previous UET posts explained how an application uses libfabric function API calls to discover available hardware resources and how this information is used to create a hardware abstraction layer composed of Fabric, Domain, and Endpoint objects, along with their child objects — Event Queues, Completion Queues, Completion Counters, Address Vectors, and Memory Regions.

This chapter explains how these objects are used during data transfer operations. It also describes how information is encoded into UET protocol headers, including the Semantic Sublayer (SES) and Packet Delivery Sublayer (PDC). In addition, the chapter covers how the Congestion Management Sublayer (CMS) monitors and controls send queue rates to prevent egress buffer overflows.

Note: In this book, libfabric API calls are divided into two categories for clarity. Functions are used to create and configure fabric objects such as fabrics, domains, endpoints, and memory regions (for example, fi_fabric(), fi_domain(), and fi_mr_reg()). Operations, on the other hand, perform actual data transfer or synchronization between processes (for example, fi_write(), fi_read(), and fi_send()).

Figure 5-1 provides a high-level overview of a libfabric Remote Memory Access (RMA) operation using the fi_write function call. When an application needs to transfer data, such as gradients, from its local memory to the memory of a GPU on a remote node, both the application and the UET provider must specify a set of parameters. These parameters ensure that the local RMA-capable NIC can forward packets to the correct destination and that the target node can locate the appropriate memory region using its process and job identifiers.

First, the application defines the operation to perform, in our example, a remote write fi_write(). It then specifies the resources involved in the transfer. The endpoint (fid_ep) represents the communication interface between the process and the underlying fabric. Each endpoint is bound to exactly one domain object, which abstracts the UET NIC. Through this binding, the UET provider automatically knows which NIC the endpoint uses, and the endpoint is automatically assigned to one or more send queues for processing work requests. This means the application does not need to manage NIC queue assignments manually.

Next, the application identifies the registered memory region (desc) that contains the local data to be transmitted. It also specifies where within that region to start reading the payload (buffer pointer: buf) and how many bytes to transfer (length: len). 

To reach the correct remote peer, the application uses a fabric address handle (fi_addr_t). The provider resolves this logical address through its Address Vector (AV) to obtain the peer’s actual fabric address—corresponding, in the UET context, to the remote UET NIC endpoint.

Finally, the application specifies the destination memory information: the remote virtual (addr) address where the data should be written and the remote protection key, which authorizes access to that memory region. 

The resulting fi_write function call, as described in the libfabric programmer’s manual, is structured as follows:

ssize_t fi_write(struct fid_ep *ep, const void *buf, size_t len, void *desc, fi_addr_t dest_addr, uint64_t addr, uint64_t key, void *context);

Next, the application fi_write operation API call is passed by the libfabric core to the UET provider. Based on the fi_addr_t handle, the provider knows which Address Vector (AV) table entry it should consult. In our example, the handle value 0x0001 corresponds to rank 1 with the fabric address 10.0.1.11.

Depending on the provider implementation, an AV entry may optionally reference a Resource Index (RI) entry. The RI table can associate the JobID with the work request and store an authorization key, if it was not provided directly by the application. It may also define which operations are permitted for the target rank.

Note: The Rank Identifier (RankID) can be considered analogous to a Process Identifier (PID); that is, the RankID defines the PID on the Fabric EndPoint (FEP).

Armed with this information, gathered from the application’s fi_write operation call and from the Address Vector and Resource Index tables, the UET provider creates a Work Request (WR) and places it into the Send Queue (SQ). Each SQ is implemented as a circular buffer in memory, shared between the GPU (running the provider) and the NIC hardware. Writing a WR into the queue does not automatically notify the NIC. To signal that new requests are available, the provider performs a doorbell operation, writing to a special NIC register. This alerts the NIC to read the SQ, determine how many WRs are pending, and identify where to start processing.

Once notified, the NIC fetches each WR, retrieves the associated metadata, such as the destination fabric address, remote memory key, and SES/PDC header information — and begins executing the data transfer. Some NICs may also periodically poll the SQ, but modern UET NICs typically rely on doorbell notifications to achieve low-latency execution.

Because GPUs and the application are multithreaded, multiple operations may be posted to the SQ simultaneously. Each WR is treated independently and can be placed in separate send queues, allowing the NIC to execute multiple transfers in parallel. This design ensures efficient utilization of both the NIC and GPU resources while maintaining correct ordering and authorization of each operation.


Figure 5-1: Mapping between libfaric operation, Provider Objects, and Hardware.

Before transmitting packets, the NIC uses the metadata retrieved from the AV and RI tables to construct the necessary protocol headers. The Semantic Sublayer (SES) header is created using information such as the JobID, process context, and authorization key, ensuring that the remote peer can correctly identify and authorize the operation. Simultaneously, the Packet Delivery Sublayer (PDC) header is prepared to control reliable delivery, sequence numbering, and congestion management. Together, these headers allow the NIC to send the payload efficiently and securely, while preserving the correct association with the source operation and enabling proper handling by the remote UET NIC.

Next, we will examine in detail how the UET headers — SES and PDC — are constructed and encapsulated with Ethernet/IP headers, and optionally UDP headers for entropy, so that packets can be efficiently routed by the scale-out backend network switches to the correct target node for further processing. On the sending side, the PDC header provides context that the UET NIC uses to manage reliable delivery, sequence numbering, and congestion control, ensuring that packets are transmitted correctly and in order to the remote peer. On the receiving side, the SES header carries the operation-specific information that tells the remote UET NIC exactly what to do — in our example, it instructs the UET NIC to WRITE a block of data to a memory address registered with target process, participating in JobID 101.


Thursday, 16 October 2025

Ultra Ethernet: Memory Region

Memory Registration and Endpoint Binding in UET with libfabric 

[updated 25-Oct, 2025 - (RIs in the figure)]

In distributed AI workloads, each process requires memory regions that are visible to the fabric for efficient data transfer. The Job framework or application typically allocates these buffers in GPU VRAM to maximize throughput and enable low-latency direct memory access. These buffers store model parameters, gradients, neuron outputs, and temporary workspace, such as intermediate activations or partial gradients during collective operations in forward and backward passes.


Memory Registration and Key Generation

Once memory is allocated, it must be registered with the fabric domain using fi_mr_reg(). Registration informs the NIC that the memory is pinned and accessible for data transfers initiated by endpoints. The fabric library associates the buffer with a Memory Region handle (fid_mr) and internally generates a remote protection key (fi_mr_key), which uniquely identifies the memory region within the Job and domain context.

The local endpoint binds the fid_mr using fi_mr_bind() to define permitted operations, FI_REMOTE_WRITE in figure 4-10. This allows the NIC to access local memory efficiently and perform zero-copy operations.

The application retrieves the memory key using fi_mr_key(fid_mr) and constructs a Resource Index (RI) entry. The RI entry serves as a compact, portable identifier of the memory region for remote ranks. It does not expose the local fid_mr, but encapsulates the key along with associated metadata necessary for remote access..


Distribution of Resource Index Pointers

During UET job initialization, only remotely accessible resources are distributed to peers. Typically, these are registered memory regions that have a local key, and an RI assigned.


Each rank sends its memory RI information to the master rank over the control channel. The transmitted metadata for each memory region includes: 

RI pointer (local identifier for the memory region)

Job ID

Rank ID

Memory Key (rkey)

Memory type (DRAM, VRAM, etc...)

Access rights (allowed RDMA operations)

Note: The local fid_mr handle is never shared, as it only has meaning inside the owning rank.

The master rank collects all entries from participating ranks and constructs a job-wide Resource Index Table. Each entry in this table corresponds to one remotely accessible memory region and includes all the metadata above. Once distributed back to all ranks, each rank can resolve a remote RI pointer into the corresponding memory key and access rights, enabling safe and efficient RDMA operations without additional control-plane lookups.

Objects that are purely local — such as domain, fabric, completion queues, event queues, or local counters — do not require RI distribution and are therefore excluded from the table. This selective sharing ensures efficiency while giving all ranks sufficient knowledge to access remote memory.


Memory Binding and Accessing Remote Memory

Once the Resource Index table is distributed, each rank binds its local memory regions to the local endpoint using fi_mr_bind(). Binding associates the fid_mr handle with the endpoint and specifies access permissions. This step ensures that the NIC can access the local memory efficiently and perform zero-copy operations.

The Resource Index table, like the AV table, contains only remote memory entries. Conceptually, the endpoint is “bound” to this table: it can automatically resolve remote RI pointers to the corresponding memory key, type, and access rights during RDMA operations. Local binding provides the actual handle and permissions needed for the NIC, while the table provides the mapping required for remote access, just as the AV table provides remote destination addresses for send operations.

When an application wants to send data, it specifies the destination using an fi_addr_t handle — a compact identifier representing the remote rank or endpoint. The application does not need to know the remote Fabric Address (FA) or RI. The local endpoint looks up the fi_addr_t in the AV table, retrieves the remote FA, and uses it to identify the remote rank. For RDMA operations, the NIC references the Resource Index table for that rank to resolve the remote RI into the memory key, type, and access rights. This combination of local fid_mr binding and remote RI table lookup allows the endpoint to safely and efficiently perform zero-copy RDMA operations.


Endpoint Binding to Monitoring and Signaling Objects

After memory regions are bound, the endpoint must also be associated with monitoring and signaling objects:

Event Queues (EQs): Deliver asynchronous fabric events, such as errors or connection state changes. Created with fi_eq_open() and bound to the endpoint using fi_ep_bind(fid_ep, fid_eq, flags).

Completion Queues (CQs): Track completion status of send, receive, or RDMA operations. Created with fi_cq_open() and bound to the endpoint using fi_ep_bind(fid_ep, fid_cq, FI_SEND | FI_RECV).

Completion Counters (CNTRs): Provide lightweight tracking of operation progress for flow control or synchronization. Created with fi_cntr_open() and bound to the endpoint using fi_ep_bind(fid_ep, fid_cntr, flags).

This allows the endpoint to receive notifications, track operation completions, and integrate with application-level synchronization.


Endpoint Enablement

Once all memory and monitoring bindings are complete, the endpoint is enabled using fi_enable(fid_ep). This final step activates the endpoint for communication, ensuring that all memory regions, Resource Index pointers, and signaling mechanisms are properly connected. After enabling, the endpoint can safely issue and receive messages, perform remote memory operations, and fully participate in the distributed AI

With memory regions registered, Resource Index pointers distributed, and endpoints fully bound and enabled, the application is now ready to perform send, receive, and remote memory operations. This chapter has focused on the high-level setup from the application’s perspective, illustrating how memory and endpoint resources are prepared for communication.




Figure 4-10: Memory Registration.

References


[1] Libfabric Programmer's Manual: Libfabric man pages https://ofiwg.github.io/libfabric/v2.3.0/man/

[2] Ultra Ethernet Specification v1.0, June 11, 2025 by Ultra Ethernet Consortium, https://ultraethernet.org

[3] In-Memory Database, Wikipedia, https://en.wikipedia.org/wiki/In-memory_database



Wednesday, 15 October 2025

Ultra Ethernet: Address Resolution with Address Vector Table

Address Vector


Overview

To enable Remote Memory Access (RMA) operations between processes, each endpoint — representing a communication channel much like a TCP socket — must know the destination process’s location within the fabric. This location is represented by the Fabric Address (FA) assigned to a Fabric Endpoint (FEP).

During job initialization, FAs are distributed through a control-plane–like procedure in which the master rank collects FAs from all ranks and then broadcasts the complete Rank-to-FA mapping to every participant (see Chapter 3 for details). Each process stores this Rank–FA mapping locally as a structure, which can then be inserted into the Address Vector (AV) Table.

When FAs from the distributed Rank-to-FA table are inserted into the AV Table, the provider assigns each entry an index number, which is published to the application as an fi_addr_t handle. After an endpoint object is bound to the AV Table, the application uses this handle — rather than the full address — when referencing a destination process. This abstraction hides the underlying address structure from the application and allows fast and efficient lookups during communication.

This mechanism resembles the functionality of a BGP Route Reflector (RR) in IP networks. Each RR client advertises its best routes, along with the associated Path Attributes, from its local BGP table. The RR collects these routes and redistributes them to other RR clients and eBGP neighbors. Upon receiving the updates, each BGP process validates the routes and installs the eligible entries into its local BGP table, from which the best routes are selected for the routing table. Similarly, in UET, the master rank collects all Fabric Addresses from participating processes, broadcasts the Rank-to-FA mapping, and each process inserts the received entries into its local Address Vector table. Figure 4-8 illustrates the process by which complete Rank-to-FA mapping information is distributed across all ranks. First (1a-c), each rank uses a control channel connection to send its FA address along with its Rank identifier to the master rank (see Chapter 3 for details). After receiving the Rank-to-FA information from all expected ranks, the master rank gathers the data and creates a complete mapping table that stores all Rank-to-FA mappings. Finally, it broadcasts this table to all ranks over the control channel connection (2).

Once each rank receives the complete Rank-to-FA mapping from the master rank, these entries must be inserted into a local Address Vector (AV) Table. The AV Table provides a compact, indexed representation of remote Fabric Addresses that can be efficiently used by the application during data transfer. The following section describes in detail how the AV Table is created, how mapping entries are inserted, and how endpoints are bound to the table to enable fast and transparent address resolution for communication operations.


Figure 4-8: Fabric Address Distributing Processes.

Constructing Address Vector Table


In distributed AI applications, each process or Rank must be able to reach other Ranks using their corresponding Fabric Addresses. The fabric provider uses a predefined lookup structure called the Address Vector (AV) to manage this mapping. The AV is a Fabric object that stores associations between logical identifiers—such as Rank indices—and their corresponding Fabric Addresses. This allows applications to reference remote endpoints through compact index values instead of full addresses, enabling efficient, low-latency address resolution entirely within user space.

The AV Table is created by calling the fi_av_open() function, as illustrated in Figure 4-9. This function initializes an empty Address Vector Table and returns a handle to the newly created object, here represented as fid_av = 0xF1DA701. In Figure 4-9, the fi_av_attr structure defines the attributes of the object. The type field is set to FI_AV_TABLE, which is the most commonly used AV type in AI applications, while the count field specifies the expected number of address entries that can be inserted into the table. The fi_av_open() call therefore completes the creation of a blank AV Table that is ready to receive mapping entries.

After receiving the complete Rank-to-FA mapping list from the master Rank, each process populates its previously created Address Vector (AV) table. The application does this using the fi_av_insert() function, which inserts the Rank-to-FA mappings into the AV Table. In this example, multiple Fabric Addresses are inserted into the previously created AV Table identified by fid_av = 0xF1DA701. The addresses to be inserted are defined in the addr field, and the count field specifies how many entries are included. During the insertion, the fabric library assigns an index value for each entry and returns these indices through the fi_addr array. Each returned index, of type fi_addr_t, represents a compact reference to a remote endpoint. For example, FA 10.0.1.11 associated with Rank 1 receives index value fi_addr_1, FA 10.0.0.12 for Rank 2 receives fi_addr_2, and FA 10.0.1.12 for Rank 3 receives fi_addr_3. These index values are later used by the application to identify destinations during communication. Instead of storing full Fabric Addresses, the application relies on these short index values, while the underlying address resolution is handled automatically by the fabric library against the entries in the AV Table.

Before the AV Table can be used for communication, it must be associated with an Endpoint. This binding is established by calling the fi_ep_bind() function (step 3). In this step, the Endpoint handle fid_ep = 0xF1DAE01 is bound to the Address Vector object fid_av = 0xF1DA701. Once the binding is complete, the Endpoint can use the AV Table for address lookups during message or RMA operations. This linkage ensures that when a data transfer is initiated, the Endpoint automatically uses the correct Address Vector for destination resolution.

When all three functions—fi_av_open(), fi_av_insert(), and fi_ep_bind()—have been executed, the index values are made available to the application for use in data transfer operations. Figure 4-9 illustrates the process. The application initiates an RMA operation to send data to Rank 2. It first checks the index value corresponding to Rank 2 from the received fi_addr_t list. The operation then proceeds through the Endpoint fid_ep = 0xF1DAE01, which has been bound to the Address Vector Table fid_av = 0xF1DA701. Because the AV Table is bound to the Endpoint, the application does not need to know which specific AV object holds the mapping. The address resolution and forwarding logic are handled transparently by the Endpoint, allowing the application to perform communication using only the lightweight index references.

This abstraction simplifies the design of distributed applications by separating address management from data transfer operations. Once the Endpoint and AV Table are properly linked, the application can perform communication using index-based references that remain valid for the lifetime of the established Address Vector. In practice, the AV Table remains active for as long as the associated Domain exists or until it is explicitly closed by the application. In dynamic environments where Rank membership may change, the AV can also be updated at runtime by reinserting or removing entries using the same insertion function. This allows the communication topology to evolve without reinitializing the entire fabric context.




Figure 4-9:
Address Vector Table_ Open, Insert & bind to Endpoint.

Sunday, 12 October 2025

Ultra Ethernet: Creating Endpoint Object

Endpoint Creation and Operation

[Updated 12-October, 2025: Figure & uet addressing section]

In libfabric and Ultra Ethernet Transport (UET), the endpoint, represented by the object fid_ep, serves as the primary communication interface between a process and the underlying network fabric. Every data exchange, whether it involves message passing, remote memory access (RMA), or atomic operations, ultimately passes through an endpoint. It acts as a software abstraction of the transport hardware, exposing a programmable interface that the application can use to perform high-performance data transfers.

Conceptually, an endpoint resembles a socket in the TCP/IP world. However, while sockets hide much of the underlying network stack behind a simple API, endpoints expose far more detail and control. They allow the process to define which completion queues to use, what capabilities to enable, and how multiple communication contexts are managed concurrently. This design gives applications, especially large distributed training frameworks and HPC workloads, direct control over latency, throughput, and concurrency in ways that traditional sockets cannot provide.

Furthermore, socket-based communication typically relies on the operating system’s networking stack and consumes CPU cycles for data movement and protocol handling. In contrast, endpoint communication paths can interact directly with the NIC, enabling user-space data transfers and RDMA operations that bypass the kernel and minimize CPU involvement.


Endpoint Types

Libfabric defines three endpoint types: active, passive, and scalable. Each serves a specific role in communication setup and operation and is created using a dedicated constructor function. Their configuration and behavior are largely determined by the information returned by fi_getinfo(), which populates a structure called fi_info. Within this structure, the subfields fi_ep_attr, fi_tx_attr, and fi_rx_attr define how the endpoint interacts with the provider and the transport hardware."



Endpoint Type Constructor Function Typical Use Case

Active Endpoint fi_endpoint() Actively sends and receives data once connections are established

Passive Endpoint fi_passive() Listens for and accepts incoming connection requests

Scalable Endpoint fi_scalable() Supports multiple transmit and receive contexts for concurrent communication

Table 4-1: Endpoint Types.


Active Endpoint

The active endpoint is the most common and versatile type. It is created using the fi_endpoint() function, which initializes an endpoint that can actively send and receive data once configured and enabled. The attributes describing its behavior are drawn from the provider’s fi_info structure returned by fi_getinfo(), which provides general provider capabilities and limits. The caps field specifies which operations the endpoint supports, such as message transmission (FI_SEND and FI_RECV), remote memory operations (FI_RMA), and atomic instructions (FI_ATOMIC).

The fi_ep_attr substructure defines the endpoint’s communication semantics. For example, FI_EP_MSG describes a reliable, connection-oriented model similar to TCP, whereas FI_EP_RDM represents a reliable datagram model with connectionless semantics. Active endpoints support only a single transmit and a single receive context. In fi_ep_attr, tx_ctx_cnt and rx_ctx_cnt are typically set to 0, which indicates that the default context should be used. In contrast, scalable endpoints can support multiple TX and RX contexts, enabling concurrent operations across several threads or contexts while sharing the same logical endpoint.

The fi_tx_attr and fi_rx_attr substructures define the operational behavior of the endpoint for sending and receiving. They specify which operations are supported (e.g., FI_SEND, FI_RMA), how many transmit or receive commands the provider should allocate resources for (size), and the message ordering guarantees (msg_order). In figure 4-1, msg_order = 0 indicates no ordering guarantees, meaning the provider does not enforce any specific ordering of operations. These attributes guide the provider in configuring internal queues and hardware resources to meet the requested behavior.

From a transport perspective, active endpoints can operate either in a connection-oriented or connectionless mode. When configured as FI_EP_MSG, the endpoint behaves much like a TCP socket, requiring an explicit connection setup before data transfer can occur. When configured as FI_EP_RDM, the endpoint exchanges messages with peers directly through the fabric. Address resolution is handled through the Address Vector table, and the PIDonFEP ensures that packets reach the correct process on the host, eliminating the need for connection handshakes. This mode is particularly useful for distributed workloads, such as collective operations in AI training.

Before an endpoint can be used, it must eventually be bound to observation objects such as completion queues, event queues, and the Address Vector. Binding and the role of the Address Vector are described in dedicated sections later.


Passive Endpoint

The passive endpoint, created with fi_passive_ep(), performs a role analogous to a listening socket in TCP. It does not transmit data itself but instead waits for incoming connection requests from remote peers. When a connection request arrives, it triggers an event in the associated Event Queue, which the application can monitor. Based on this event, the application can decide whether to accept or reject the request by calling fi_accept() or fi_reject().

Because passive endpoints exist solely for connection management, they always operate in FI_EP_MSG mode and do not perform data transmission. Consequently, they only need to bind to an Event Queue, which receives notifications of connection attempts, completions, and related state changes. Once a connection is accepted, the provider automatically creates a corresponding active endpoint that can then send and receive data. In this process, the Address Vector is updated to include the new peer’s address, ensuring that subsequent communication can occur transparently.


Scalable Endpoint

The scalable endpoint represents the most advanced and parallelized endpoint type. It is constructed with fi_scalable_ep() and is designed for applications that require massive concurrency, such as distributed AI training jobs running on multiple GPUs or multi-threaded inference servers.

Unlike a regular endpoint, a scalable endpoint can be subdivided into multiple transmit (TX) and receive (RX) contexts. Each context acts like an independent communication lane with its own completion queues and state, allowing multiple threads to perform concurrent operations without locking or contention. In essence, scalability in this context means that a single endpoint object can represent many lightweight communication contexts that share the same underlying hardware resources.

The number of transmit and receive contexts for a scalable endpoint is specified in fi_ep_attr using tx_ctx_cnt and rx_ctx_cnt. The provider validates the requested counts against the domain’s capabilities (max_ep_tx_ctx and max_ep_rx_ctx), ensuring that the requested contexts do not exceed the hardware limits. The capabilities field (caps) indicates support for features such as FI_SHARED_CONTEXT and FI_MULTI_RECV, which are required for efficient scalable endpoint operation.

From a conceptual point of view, this arrangement can be compared to a single process maintaining multiple sockets, each used by a separate thread to communicate independently. However, scalable endpoints achieve this concurrency without the overhead of creating and maintaining separate endpoints per thread. Each context binds to its own completion queue, enabling parallel operation, while the scalable endpoint as a whole binds only once to the Event Queue and Address Vector. This design allows high-throughput, low-latency communication suitable for GPU-parallelized data exchange, where multiple communication streams operate concurrently between the same set of nodes.


UET Endpoint Address

When a UET endpoint is created, it receives a UET Endpoint Address, which uniquely identifies it within the fabric. Unlike a simple IP or MAC-style address, this is a composite structure that encodes several critical attributes of the endpoint and its execution context. The fields defined in UET Specification v1.0 include Version, Flags, Fabric Endpoint Capabilities, PIDonFEP, Fabric Address (FA), Start Resource Index, Num Resource Indices, and Initiator ID. Together, these fields describe both where the endpoint resides and how it interacts with other endpoints across the transport layer.

Address Assignment Process

The UET address assignment is a multi-stage process that completes when the endpoint is opened and the final address is retrieved. During the fi_ep_open() call, the provider uses the information from the fi_info structure and its substructures, including the transmit and receive attributes, to form the basic addressing context. At this stage, the provider contacts the UET kernel driver and finalizes the internal endpoint address by adding UET-specific fields such as the JobID, PIDonFEP, Initiator ID, and resource indices.

Although these fields are now part of the provider’s internal address structure, the application does not yet have access to them. To retrieve the complete and fully resolved UET address, the application must call the fi_getname() API. This call returns the composite address containing all the fields that were assigned during the fi_ep_open() operation.

In summary, fi_ep_open() finalizes the endpoint’s address internally, while fi_getname() exposes this finalized address to the application.

Version and Flags

The Version field ensures compatibility between different revisions of the UET library or NIC firmware, allowing peers to interpret the endpoint address structure correctly.

The Flags field defines both the validity of other fields in the endpoint address and the addressing behavior of the endpoint. It verifies whether the Fabric Endpoint Capabilities, PIDonFEP, Resource Index, or Initiator ID fields are valid or not. It also indicates which addressing mode is used (relative or absolute), the type of address in the Fabric Address (IPv4 or IPv6), and whether the maximum message size is limited to the domain object’s (NIC) MTU.

From a transport perspective, the flags make the endpoint address self-descriptive: peers can parse it to know which fields are valid and how messages should be addressed without additional negotiation. This is especially important in large distributed AI clusters, where thousands of endpoints are initialized and need to communicate efficiently.

Fabric Endpoint Capabilities

The Fabric Endpoint Capabilities field complements the flags by describing what the endpoint can actually do. It indicates the supported operational profile of the endpoint. In UET, three profiles are defined:

  • AI Base Profile: Provides the minimal feature set for distributed AI workloads, including reliable messaging, basic RDMA operations, and standard memory registration.
  • AI Full Profile: Extends AI Base with advanced features, such as optimized queue handling and SES header enhancements, which are critical for large-scale, high-throughput AI training.
  • HPC Profile: Optimized for low-latency, synchronization-heavy workloads typical in high-performance computing.

Each profile requires specific capabilities and parameters to be set in the fi_info structure, which the UET provider returns during fi_getinfo(). These include supported operations (such as send/receive, RDMA, or atomics), maximum message sizes, queue depths, memory registration limits, and other provider-specific settings. 

For a complete list of per-profile requirements, see Table 2‑4, “Per-Profile Libfabric Parameter Requirements,” in the UET Specification v1.0.


PIDonFEP – Process Identifier on Fabric Endpoint

Every Fabric Endpoint (FEP) managed by a UET NIC has a Process Identifier on FEP (PIDonFEP), which distinguishes different processes or contexts sharing the same NIC. For example, each PyTorch or MPI rank can create one or more endpoints, and each endpoint’s PIDonFEP ensures that the NIC can demultiplex incoming packets to the correct process context.

PIDonFEP is analogous to an operating system PID but scoped to the fabric device rather than the OS. It is also critical for resource isolation: multiple AI training jobs can share the same NIC hardware while maintaining independent endpoint contexts.


Fabric Address (FA)

The NIC’s Fabric Address (FA) is used for network-level routing and is inserted into the IP header so that packets are delivered to the correct NIC over the backend network. Once a packet reaches the NIC, the PIDonFEP value, carried within the Session header in the data plane, identifies the target process or endpoint on that host. Together, FA and PIDonFEP form a globally unique identifier for the endpoint within the UET fabric, allowing multiple processes and workloads to share the same NIC without conflicts.

The mode of the address (relative or absolute) and the address type (IPv4 or IPv6) is defined by flag bits. Relative addresses, meaningful only within the scope of a job or endpoint group, are commonly used in AI and distributed training workloads to simplify routing and reduce overhead.

Note: Relative addresses typically include a Job Identifier to distinguish endpoints belonging to different jobs, in addition to the PIDonFEP and, optionally, an index. This ensures uniqueness within the job while keeping addresses lightweight.

Note: Details of data plane encapsulation, including the structure of headers and how fields like PIDonFEP are carried, are explained in upcoming chapters on data transport.


Start Resource Index and Num Resource Indices

Each endpoint is allocated a range of Resource Indices. The Start Resource Index indicates the first index assigned to the endpoint, and Num Resource Indices specifies how many consecutive indices are reserved.

This range acts as a local namespace for all objects associated with the endpoint, including memory regions, completion queues, and transmit/receive contexts. When a remote peer performs an operation such as fi_remote_write() targeting a specific index, the NIC can resolve the correct resource entirely in hardware without software intervention.

For example, if an endpoint’s Start Resource Index is 300 and Num Resource Indices is 64, all local resources occupy indices 300–363. A remote write to index 312 is immediately mapped to the appropriate memory region or queue, enabling high-throughput, low-latency operations required in large-scale AI training clusters.


Initiator ID and Job Identification

The Initiator ID provides an optional identifier for the training job or distributed session owning the endpoint. In AI frameworks, this often corresponds to a Job ID, grouping multiple ranks within the same session. In relative addressing mode, the Job ID forms part of the endpoint address, ensuring that endpoints belonging to different jobs remain distinct even if they share the same NIC and PIDonFEP indices.

Note: Implementation details, including how the Job ID is carried in fi_ep_attr via the auth_key field when auth_key_size = 3, are explained in the upcoming transport chapter.

As described earlier, UET address assignment is a multi-stage process. Figure 4‑7 illustrates the basic flow of how a complete UET address is assigned to an endpoint. While the figure shows only the fi_ep_open() API call, the final fid_ep object box provides an illustrative example of the fully resolved endpoint after the application has also called fi_getname() to retrieve the complete provider-specific UET address.

The focus of this section has been on opening an endpoint, rather than detailing the full addressing schema. Later, in the packet transport chapter, we will examine the UET address structure and its role in data plane operations in greater depth.





Figure 4-7: Objects Creation Process – Endpoint (EP).

Wednesday, 8 October 2025

Ultra Ethernet: Fabric Object - What it is and How it is created

Fabric Object


Fabric Object Overview

In libfabric, a fabric represents a logical network domain, a group of hardware and software resources that can communicate with each other through a shared network. All network ports that can exchange traffic belong to the same fabric domain. In practice, a fabric corresponds to one interconnected network, such as an Ethernet or Ultra Ethernet Transport (UET) fabric.

A good way to think about a fabric is to compare it to a Virtual Data Center (VDC) in a cloud environment. Just as a VDC groups together compute, storage, and networking resources into an isolated logical unit, a libfabric fabric groups together network interfaces, addresses, and transport resources that belong to the same communication context. Multiple fabrics can exist on the same system, just like multiple VDCs can operate independently within one cloud infrastructure.

The fabric object acts as the top-level context for all communication. Before an application can create domains, endpoints, or memory regions, it must first open a fabric using the fi_fabric() call. This creates the foundation for all other libfabric objects.

Each fabric is associated with a specific provider,  for example, libfabric-uet, which defines how the fabric interacts with the underlying hardware and network stack. Once created, the fabric object maintains provider-specific state, hardware mappings, and resource visibility for all subsequent objects created under it.

For the application, the fabric object is simply a handle to a network domain that other libfabric calls will use. For the provider, it is the root structure that connects all internal data structures and controls how communication resources are managed within the same network fabric.

The following section explains how the application requests a fabric object and how the provider and libfabric core work together to create and publish it.


Creating Fabric Object

After the UET provider populates the fi_info structures for each NIC/port combination during discovery, the application can begin creating objects. It first consults the fi_info list to identify the entry that best matches its requirements. Figure 4-3 shows an illustrative example of how the application calls fi_fabric() to request a fabric object based on the fi_fabric_attr sub-structure of fi_info[0] corresponding to NIC Eth0.

Once the application issues the API call, the request is handed over to the libfabric core, which acts as a lightweight coordinator and translator between the application and the provider.

The UET provider receives the request via the uet_fabric function pointer. It may first verify that the requested fabric name is still valid and supported by NIC 0 before consulting the fi_fabric_attr structure specified in the API call. The provider then creates the fabric object, defining its type, fabric ID (fid), provider, and fabric name.


Figure 4-3: Objects Creation Process – Fabric for Cluster.

Memory Allocation and Object Publication


In libfabric, memory for fabric objects — as well as for all other objects created later, such as domains, endpoints, and memory regions — is allocated by the provider, not the libfabric core. The provider is responsible for creating the actual data structures that represent these objects and for mapping them to underlying hardware resources. This ensures that the object’s state, capabilities, and hardware associations are maintained correctly and consistently.

When the provider creates a fabric object, it allocates memory in its own address space and initializes all internal fields, including type, fabric ID (fid), provider-specific metadata, and associated NIC resources. Once the object is fully initialized, the provider returns a pointer to the libfabric core. This pointer effectively tells the core the location of the object in memory.

The libfabric core then wraps this provider pointer in a lightweight descriptor called fid_fabric, which acts as the application-visible handle for the fabric object. This descriptor contains metadata and a reference to the provider-managed object, allowing the core to track and route subsequent API calls correctly without duplicating the object. The core stores the fid_fabric handle in its internal tables, enabling fast lookup and validation whenever the application references the fabric in later calls.
Finally, the libfabric core returns the fid_fabric handle to the application. From the application’s perspective, this handle uniquely identifies the fabric object, while internally the provider maintains the persistent state and hardware mappings.


Tuesday, 7 October 2025

Ultra Etherent: Discovery

Updated 8-Ocotber 2025

Creating the fi_info Structure

Before the application can discover what communication services are available, it first needs a way to describe what it is looking for. This description is built using a structure called fi_info. The fi_info structure acts like a container that holds the application’s initial requirements, such as desired endpoint type or capabilities.

The first step is to reserve memory for this structure in the system’s main memory. The fi_allocinfo() helper function does this for the application. When called, fi_allocinfo() allocates space for a new fi_info structure, which this book refers to as the pre-fi_info, that will later be passed to the libfabric core for matching against available providers.

At this stage, most of the fields inside the pre-fi_info structure are left at their default values. The application typically sets only the most relevant parameters that express what it needs, such as the desired endpoint type or provider name, and leaves the rest for the provider to fill in later.

In addition to the main fi_info structure, the helper function also allocates memory for a set of sub-structures. These describe different parts of the communication stack, including the fabric, domain, and endpoints. The fixed sub-structures include:

  • fi_fabric_attr: Describes the fabric-level properties such as the provider’s name and fabric identifier.
  • fi_domain_attr: Defines attributes related to the domain, including threading and data progress behavior.
  • fi_ep_attr: Specifies endpoint-related characteristics such as endpoint type.
  • fi_tx_attr and fi_rx_attr: Contain parameters for transmit and receive operations, such as message ordering and completion behavior.
  • fi_nic: Represents the properties and capabilities of the UET network interface.

In figure 4-1, these are labeled as fixed sub-structures because their layout and meaning are always the same. They consist of predefined fields and expected value types, which makes them consistent across different applications. Like the main fi_info structure, they usually remain at their default values until the provider fills them in. The information stored in these sub-structures will later be leveraged when the application begins creating actual fabric objects, such as domains and endpoints.

In addition to the fixed parts, the fi_info structure can contain generic sub-structures such as src_addr. Unlike the fixed ones, the generic sub-structure src_addr depend on the chosen addr_format. For example, when using Ultra Ethernet Transport, the address field points to a structure describing a UET endpoint address, which includes bits for Version, Flags, Fabric Endpoint Capabilities, PIDonFEB, Fabric Address, Start Resource Index, Num Resource Indices, and Initiator ID. This compact representation carries both addressing and capability information, allowing the same structure definition to be reused across different transport technologies and addressing schemes. Note that in figure 4-2 the returned src_addr is only partially filled because the complete address information is not available until the endpoint is created.

In Figure 4-1, the application defines its communication requirements in the fi_info structure by setting the caps (capabilities) field. This field describes the types of operations the application intends to perform through the fabric interface. For example, values such as FI_MSG, FI_RMA, FI_WRITE, FI_REMOTE_WRITE, FI_COLLECTIVE, FI_ATOMIC, and FI_HMEM specify support for message-based communication, remote memory access, atomic operations, and host memory extensions.

When the fi_getinfo() call is issued, the provider compares these requested capabilities against what the underlying hardware and driver can support. Only compatible providers return a matching fi_info structure.

In this example, the application also sets the addr_format field to FI_ADDR_UET, indicating that Ultra Ethernet Transport endpoint addressing is used. This format includes hardware-specific addressing details beyond a simple IP address.

The current Ultra Ethernet Transport specification v1.0 does not define or support the FI_COLLECTIVE capability. Therefore, the UET provider does not return this flag, and collective operations are not offloaded or accelerated by the UET NIC.

After fi_allocinfo() has allocated memory for both the fixed and generic sub-structures, it automatically links them together by inserting pointers into the main fi_info structure. The application can then easily access each attribute through fi_info without manually handling the memory layout. 

Once the structure is prepared, the next step is to request matching provider information using fi_getinfo() API call, which will be described in detail in the following section.

Figure 4-1: Discovery: Allocate Memory and Create Structures – fi_allocinfo.

Requesting Provider Services with fi_getinfo()


After creating a pre-fi_info structure, the application calls the fi_getinfo() API to discover which services and transport features are available on the node’s NIC(s). This function takes a pointer to the pre-fi_info structure, which contains hints describing the application’s requirements, such as desired capabilities and address format.

When the discovery request reaches the libfabric core, the library identifies and loads an appropriate provider from the available options, which may include providers for TCP, verbs (InfiniBand), or sockets (TCP/UDP). For Ultra Ethernet Transport, the core selects the uet-provider. The core invokes the provider’s entry points, including the .getinfo callback, which is responsible for returning the provider’s supported capabilities. Internally, the provider uses function pointers uet_getinfo.

Inside the UET provider, the uet_getinfo() routine queries the NIC driver or kernel interface to determine what capabilities each NIC can safely and efficiently support. The provider does not access the hardware directly. For multi-GPU AI workloads, the focus is on push-based remote memory access operations:

  • FI_MSG: Used for standard message-based communication.
  • FI_RMA: Enables direct remote memory access, forming the foundation for high-performance gradient or parameter transfers between GPUs.
  • FI_WRITE: Allows a GPU to push local gradient updates directly into another GPU’s memory.
  • FI_REMOTE_WRITE: Signals that remote GPUs can write directly into this GPU’s memory, supporting push-based collective operations.
  • FI_COLLECTIVE: Indicates support for collective operations like AllReduce, though the current UET specification does not implement this capability.
  • FI_ATOMIC: Allows atomic operations on remote memory.
  • FI_HMEM: Marks support for host memory or GPU memory extensions.

Low-level hardware metrics, such as link speed or MTU, are not returned at this phase; the focus is on semantic capabilities that the application can rely on. 

The provider allocates new fi_info structures in CPU memory, creating one structure per NIC that satisfies the hints provided by the application and describes all other supported services.

After the provider has created these structures, libfabric returns them to the application as a linked list. The next pointer links all available fi_info structures, allowing the application to iterate over the discovered NICs. Each fi_info entry contains both top-level fields, such as caps and addr_format, and several attribute sub-structures—fi_fabric_attr, fi_domain_attr, fi_ep_attr, fi_tx_attr, and fi_rx_attr.

Even if the application provides no hints, the provider fills in these attribute groups with its default or supported values. This ensures that every fi_info structure returned by fi_getinfo() contains a complete description of the provider’s capabilities and configuration options. During object creation, the libfabric core passes these attributes to the provider, which uses them to map the requested fabric, domain, and endpoint objects to the appropriate NIC and transport configuration.


The application can then select the most appropriate entry and request the creation of the Fabric object. Creating the Fabric first establishes the context in which all subsequent domains, sub-resources, and endpoints are organized and managed. Once all pieces of the AI Fabric “jigsaw puzzle” abstraction have been created and initialized, the application can release the memory for all fi_info structures in the linked list by calling fi_freeinfo().





Figure 4-2: Discovery: Discover Provider Capabilities – fi_getinfo.


Note: fields and values of fi_info and its sub-structures are explained in upcoming chapters

Next, Fabric Object...