Wednesday, 15 October 2025

Ultra Ethernet: Address Resolution with Address Vector Table

Address Vector


Overview

To enable Remote Memory Access (RMA) operations between processes, each endpoint — representing a communication channel much like a TCP socket — must know the destination process’s location within the fabric. This location is represented by the Fabric Address (FA) assigned to a Fabric Endpoint (FEP).

During job initialization, FAs are distributed through a control-plane–like procedure in which the master rank collects FAs from all ranks and then broadcasts the complete Rank-to-FA mapping to every participant (see Chapter 3 for details). Each process stores this Rank–FA mapping locally as a structure, which can then be inserted into the Address Vector (AV) Table.

When FAs from the distributed Rank-to-FA table are inserted into the AV Table, the provider assigns each entry an index number, which is published to the application as an fi_addr_t handle. After an endpoint object is bound to the AV Table, the application uses this handle — rather than the full address — when referencing a destination process. This abstraction hides the underlying address structure from the application and allows fast and efficient lookups during communication.

This mechanism resembles the functionality of a BGP Route Reflector (RR) in IP networks. Each RR client advertises its best routes, along with the associated Path Attributes, from its local BGP table. The RR collects these routes and redistributes them to other RR clients and eBGP neighbors. Upon receiving the updates, each BGP process validates the routes and installs the eligible entries into its local BGP table, from which the best routes are selected for the routing table. Similarly, in UET, the master rank collects all Fabric Addresses from participating processes, broadcasts the Rank-to-FA mapping, and each process inserts the received entries into its local Address Vector table. Figure 4-8 illustrates the process by which complete Rank-to-FA mapping information is distributed across all ranks. First (1a-c), each rank uses a control channel connection to send its FA address along with its Rank identifier to the master rank (see Chapter 3 for details). After receiving the Rank-to-FA information from all expected ranks, the master rank gathers the data and creates a complete mapping table that stores all Rank-to-FA mappings. Finally, it broadcasts this table to all ranks over the control channel connection (2).

Once each rank receives the complete Rank-to-FA mapping from the master rank, these entries must be inserted into a local Address Vector (AV) Table. The AV Table provides a compact, indexed representation of remote Fabric Addresses that can be efficiently used by the application during data transfer. The following section describes in detail how the AV Table is created, how mapping entries are inserted, and how endpoints are bound to the table to enable fast and transparent address resolution for communication operations.


Figure 4-8: Fabric Address Distributing Processes.

Constructing Address Vector Table


In distributed AI applications, each process or Rank must be able to reach other Ranks using their corresponding Fabric Addresses. The fabric provider uses a predefined lookup structure called the Address Vector (AV) to manage this mapping. The AV is a Fabric object that stores associations between logical identifiers—such as Rank indices—and their corresponding Fabric Addresses. This allows applications to reference remote endpoints through compact index values instead of full addresses, enabling efficient, low-latency address resolution entirely within user space.

The AV Table is created by calling the fi_av_open() function, as illustrated in Figure 4-9. This function initializes an empty Address Vector Table and returns a handle to the newly created object, here represented as fid_av = 0xF1DA701. In Figure 4-9, the fi_av_attr structure defines the attributes of the object. The type field is set to FI_AV_TABLE, which is the most commonly used AV type in AI applications, while the count field specifies the expected number of address entries that can be inserted into the table. The fi_av_open() call therefore completes the creation of a blank AV Table that is ready to receive mapping entries.

After receiving the complete Rank-to-FA mapping list from the master Rank, each process populates its previously created Address Vector (AV) table. The application does this using the fi_av_insert() function, which inserts the Rank-to-FA mappings into the AV Table. In this example, multiple Fabric Addresses are inserted into the previously created AV Table identified by fid_av = 0xF1DA701. The addresses to be inserted are defined in the addr field, and the count field specifies how many entries are included. During the insertion, the fabric library assigns an index value for each entry and returns these indices through the fi_addr array. Each returned index, of type fi_addr_t, represents a compact reference to a remote endpoint. For example, FA 10.0.1.11 associated with Rank 1 receives index value fi_addr_1, FA 10.0.0.12 for Rank 2 receives fi_addr_2, and FA 10.0.1.12 for Rank 3 receives fi_addr_3. These index values are later used by the application to identify destinations during communication. Instead of storing full Fabric Addresses, the application relies on these short index values, while the underlying address resolution is handled automatically by the fabric library against the entries in the AV Table.

Before the AV Table can be used for communication, it must be associated with an Endpoint. This binding is established by calling the fi_ep_bind() function (step 3). In this step, the Endpoint handle fid_ep = 0xF1DAE01 is bound to the Address Vector object fid_av = 0xF1DA701. Once the binding is complete, the Endpoint can use the AV Table for address lookups during message or RMA operations. This linkage ensures that when a data transfer is initiated, the Endpoint automatically uses the correct Address Vector for destination resolution.

When all three functions—fi_av_open(), fi_av_insert(), and fi_ep_bind()—have been executed, the index values are made available to the application for use in data transfer operations. Figure 4-9 illustrates the process. The application initiates an RMA operation to send data to Rank 2. It first checks the index value corresponding to Rank 2 from the received fi_addr_t list. The operation then proceeds through the Endpoint fid_ep = 0xF1DAE01, which has been bound to the Address Vector Table fid_av = 0xF1DA701. Because the AV Table is bound to the Endpoint, the application does not need to know which specific AV object holds the mapping. The address resolution and forwarding logic are handled transparently by the Endpoint, allowing the application to perform communication using only the lightweight index references.

This abstraction simplifies the design of distributed applications by separating address management from data transfer operations. Once the Endpoint and AV Table are properly linked, the application can perform communication using index-based references that remain valid for the lifetime of the established Address Vector. In practice, the AV Table remains active for as long as the associated Domain exists or until it is explicitly closed by the application. In dynamic environments where Rank membership may change, the AV can also be updated at runtime by reinserting or removing entries using the same insertion function. This allows the communication topology to evolve without reinitializing the entire fabric context.




Figure 4-9:
Address Vector Table_ Open, Insert & bind to Endpoint.

No comments:

Post a Comment