Address Vector (AV)
The Address Vector (AV) is a provider-managed mapping that connects remote fabric addresses to compact integer handles (fi_addr_t) used in communication operations. Unlike a routing table, the AV does not store IP (device mappings). Instead, it converts an opaque Fabric Address (FA)—which may contain IP, port, and transport-specific identifiers—into a simple handle that endpoints can use for sending and receiving messages. The application never needs to reference the raw IP addresses directly.
Phase 1: Application – Request & Definition
The application begins by requesting an Address Vector (AV) through the fi_av_open() call. To do this, it first defines the desired AV properties in a fi_av_attr structure:
int fi_av_open(struct fid_domain *domain, struct fi_av_attr *attr, struct fid_av **av, void *context);
struct fi_av_attr av_attr = {
.type = FI_AV_TABLE,
.count = 16,
.rx_ctx_bits = 0,
.ep_per_node = 1,
.name = "my_av",
.map_addr = NULL,
.flags = 0
};
Example 4-1: structure fi_av_attr.
fi_av_attr Fields
type: Specifies the type of Address Vector. In Ultra Ethernet, the most common choice is FI_AV_TABLE, which organizes peer addresses in a simple, contiguous table. Each inserted Fabric Address (FA) receives a sequential fi_addr_t starting from zero. The type determines how the provider manages and looks up peer addresses during runtime.
Count: Indicates the expected number of addresses that will be inserted into the AV. The provider uses this value to optimize resource allocation. In most implementations, this also acts as an upper bound—if the application attempts to insert more addresses than specified, the operation may fail. Therefore, the application should set this field according to the anticipated communication pattern, ensuring the AV is sized appropriately without over-allocating memory.
Receive Context Bits (rx_ctx_bits): The rx_ctx_bits field is relevant only for scalable endpoints, which can expose multiple independent receive contexts. Normally, when an address is inserted into the Address Vector (AV), the provider returns a compact handle (fi_addr_t) that identifies the peer. If the peer has multiple receive contexts, the handle must also encode which specific receive context the application intends to target.
The rx_ctx_bits field specifies how many bits of the fi_addr_t are reserved for this purpose. For example, if an endpoint has 8 receive contexts (rx_ctx_cnt = 8), then at least 3 bits are required (2^3 = 8) to distinguish all contexts. These reserved bits allow the application to refer to each receive context individually without inserting duplicate entries into the AV.
The provider uses rx_ctx_bits internally when constructing the fi_addr_t returned by fi_av_insert(). The helper function fi_rx_addr() can then combine a base fi_addr_t with a receive context index to produce the final address used in communication operations.
Example:
Suppose a peer has 4 receive contexts, and the application sets rx_ctx_bits = 2 (2 bits reserved for receive contexts). The base fi_addr_t for the peer returned by the AV might be 0x10. Using the 2 reserved bits, the application can address each receive context as follows:
Peer FA Base fi_addr_t Receive Context Final fi_addr_t
10.0.0.11:7500 0x10 0 0x10
10.0.0.11:7500 0x10 1 0x11
10.0.0.11:7500 0x10 2 0x12
10.0.0.11:7500 0x10 3 0x13
This approach keeps the AV compact while allowing fine-grained targeting of receive contexts. The application never needs to manipulate raw IP addresses or transport identifiers; it simply uses the final fi_addr_t values in send, receive, or RDMA operations.
ep_per_node: Indicates the expected number of endpoints that will be associated with a given Fabric Address. The provider uses this value to optimize resource allocation. If the number of endpoints per node is unknown, the application can set it to 0. In distributed or parallel applications, this is typically set to the number of processes per node multiplied by the number of endpoints each process will open. This value is a hint rather than a strict limit, allowing the AV to scale efficiently in multi-endpoint configurations.
Name: An optional human-readable system name for the AV. If non-NULL and the AV is opened with write access, the provider may create a shared, named AV, which can be accessed by multiple processes within the same domain on the same node. This feature enables resource sharing in multi-process applications. If sharing is not required, the name can be set to NULL.
map_addr: An optional base address for the AV, used primarily when creating a shared, named AV (FI_AV_MAP) across multiple processes. If multiple processes provide the same map_addr and name, the AV guarantees that the fi_addr_t handles returned by fi_av_insert() are consistent across all processes. The provider may internally memory-map the AV at this address, but this is not required. In single-process AVs, or when the AV is private, this field can be set to NULL.
Flags: Behavior-modifying flags. These can control provider-specific features, such as enabling asynchronous operations or special memory access patterns. Setting it to 0 selects default behavior.
Once the structure is populated, the application calls:
The fi_av_open() call allocates a provider-managed fid_av object, which contains a pointer to operations such as insert(), remove(), and lookup(), along with an internal mapping table that translates integer handles into transport-specific Fabric Addresses. At this stage, the AV exists independently of endpoints or peers and contains no entries; it is a blank, ready-to-populate mapping structure.
The AV type, described in Example 4-1, determines how the provider organizes and manages peer addresses. In Ultra Ethernet, FI_AV_TABLE is the most common choice. Conceptually, it is a simple, contiguous table that maps each peer’s Fabric Address to a sequential integer handle (fi_addr_t). Each inserted FA receives a handle starting from zero, and the application never accesses the raw IP or transport identifiers directly, relying solely on the fi_addr_t to refer to peers. From the Ultra Ethernet perspective, FI_AV_TABLE is ideal because it allows constant-time lookup of a peer’s FA. When the application posts a send, receive, or memory operation, the provider translates the fi_addr_t into the transport-specific FA efficiently. Internally, each table entry can also reference provider-managed resources, such as transmit and receive contexts or resource indexes, making runtime operations lightweight and predictable.
The application should select attribute values in line with the provider’s capabilities reported by fi_getinfo(), considering the expected number of peers and any provider-specific recommendations. This ensures that the AV is sized appropriately for the intended communication pattern and supports efficient, opaque addressing throughout the system.
Example: Mapping fi_addr_t index to peer FA and Rank
fi_addr_t Peer Rank Fabric Address (FA)
0 1 10.0.0.11:7500
1 2 10.0.0.12:7500
2 3 10.0.0.13:7500
Each inserted Fabric Address receives a sequential integer handle (fi_addr_t) starting from zero. The application then uses this handle in communication operations, for example:
fi_addr_t peer2 = fi_addrs[1]; // index 1 corresponds to rank 2
fi_send(ep, buffer, length, NULL, peer2, 0);
This mapping allows the application to refer to peers purely by fi_addr_t handles, avoiding any direct manipulation of IP addresses or transport identifiers, and enabling efficient, opaque peer referencing in the data path.
Phase 2: Provider – Validation & Limit Check.
After the application requests an AV using fi_av_open(), the provider validates the requested attributes against the capabilities reported in the domain attributes (fi_domain_attr) returned earlier by fi_getinfo(). The fid_domain serves as the context for AV creation, but the actual limits and supported features come from the domain attributes.
The provider performs the following checks to ensure the AV configuration is compatible with its supported limits:
- type is compared against fi_info structure -> fi_domain_attr sub-structure -> av_type to verify that the requested AV organization (for example, FI_AV_TABLE) is supported.
- count is checked to ensure it does not exceed the maximum number of entries the provider can manage efficiently.
- rx_ctx_bits and ep_per_node are validated against domain context limits, such as rx_ctx_cnt, tx_ctx_cnt, and max_ep_*_ctx values, to guarantee that the requested receive context and endpoint configuration can be supported.
- flags are compared against fi_info->caps and fi_info->mode to confirm that the requested behaviors are allowed.
Note: The caps field lists the capabilities supported by the provider that an application may request, such as RMA, atomics, or tagged messaging. The mode field, in contrast, specifies mandatory requirements imposed by the provider. If a mode bit is set, the application must comply with that constraint to use the provider. Typical examples include requiring all memory buffers to be explicitly registered (FI_LOCAL_MR) or the use of struct fi_context with operations (FI_CONTEXT).
- name and map_addr are optional; the provider may validate them if the AV is intended to be shared or named, but they can also be ignored for local/private AVs.
If all checks pass, the provider allocates the AV and returns a handle (fid_av *) to the application. At this stage, the AV exists as an empty container, independent of any endpoints or peer Fabric Addresses. It is ready to be populated with addresses in the subsequent phase.
Phase 3: Population and distribution of FAs.
Once the AV has been created and validated, it must be populated with the Fabric Addresses (FAs) of all peers that the application will communicate with. Initially, each process only knows its own local FA. To enable inter-process communication, these local FAs must be exchanged and distributed to all ranks, typically using an out-of-band control channel such as a bootstrap TCP connection.
In a common master-collect model, the master process (Rank 0) collects the local FAs from all worker processes and constructs a global map of {Rank → FA}. The master then broadcasts this global mapping back to all ranks. Each process inserts the received FAs into its AV using fi_av_insert().
Each inserted FA is assigned a sequential integer handle (fi_addr_t) starting from zero. These handles are then used in all subsequent communication operations, so the application never directly references IP addresses or transport-specific identifiers.
Example: Populating the AV with peer FAs
struct sockaddr_in peers[3];
peers[0] = { ip=10.0.0.11, port=7500 }; // rank 1
peers[1] = { ip=10.0.0.12, port=7500 }; // rank 2
peers[2] = { ip=10.0.0.13, port=7500 }; // rank 3
fi_addr_t fi_addrs[3];
ret = fi_av_insert(av, peers, 3, fi_addrs, 0, NULL);
Note on Local Fabric Address: In most Ultra Ethernet applications, the local Fabric Address (FA) is not inserted into the Address Vector (AV_TABLE). This is because it is rare for a process to communicate with itself; endpoints typically only need to send or receive messages to remote peers. As a result, the AV contains entries only for remote FAs, each mapped to a sequential fi_addr_t handle. The local FA remains known to the process but does not occupy an AV entry, keeping the mapping compact and efficient. If an application does require self-communication, the local FA can be explicitly inserted into the AV, but this is an uncommon scenario.
Once this step is complete, the AV contains the full mapping of all ranks to their assigned fi_addr_t handles, including the master itself. The application can now communicate with any peer by passing its fi_addr_t to libfabric operations.
Example: Using fi_addr_t in a send operation
fi_addr_t peer2 = fi_addrs[1]; // index 1 corresponds to rank 2
fi_send(ep, buffer, length, NULL, peer2, 0);
This design allows all communication operations to reference peers via compact, opaque integer handles, keeping the data path efficient and abstracted from transport-specific details.
Phase 4: Runtime usage.
After the AV is populated with all peer Fabric Addresses (FAs), the provider handles all runtime operations transparently. The fi_addr_t handles inserted by the application are used internally to direct traffic to the correct peer, hiding the underlying transport details.
- Send and Receive Operations: When the application posts a send, receive, or RDMA operation, the provider translates the fi_addr_t handle into the transport-specific FA. The application never needs to reference IP addresses, ports, or other identifiers directly.
- Updating the AV: If a peer’s FA needs to be removed or replaced, the provider uses its AV operations (fi_ops_av->remove() and fi_ops_av->insert()) to update the mapping seamlessly. These updates maintain the integrity of the fi_addr_t handles across the application.
- Resource References: Each table entry may internally reference provider-managed resources, such as transmit/receive contexts or reserved bits for scalable endpoints. This ensures that runtime operations are lightweight and predictable.
- Consistency Across Processes: For shared or named AVs, the provider ensures that fi_addr_t handles are consistent across processes, allowing communication to function correctly even in distributed, multi-node applications.
At this stage, the AV is fully functional. The application can perform communication with any known peer using the opaque fi_addr_t handles. This design keeps the data path efficient, predictable, and independent of transport-specific details, just like the Completion Queue and Event Queue abstractions.
Example fid_av (Address Vector) – Illustrative
fid_av {
fid_type : FI_AV
fid : 0xF1AD01
parent_fid : 0xF1DD001
provider : "libfabric-uet"
type : FI_AV_TABLE
count : 16
rx_ctx_bits : 0
ep_per_node : 1
flags : 0x00000000
name : "my_av"
map_addr : 0x0
ops : 0xF1AA10
fi_addr_table : [0..count-1]
}
Explanation of fields:
Note: Some of these fields were previously explained as part of the fi_av_attr structure.
Identity
- fid_type: Specifies the type of object. For an Address Vector, this is always FI_AV. This allows the provider and application to distinguish AVs from other objects like fid_cq, fid_eq, or fid_domain.
- Fid: A unique identifier for this AV instance, typically assigned by the provider. It serves as a handle for internal management and debugging, similar to fid_cq or fid_eq.
- parent_fid: Points to the parent domain (fid_domain) in which this AV was created. This links the AV to its domain’s resources and capabilities and ensures proper hierarchy in the provider’s object model.
- Name: Optional human-readable system name for the AV. Used to identify shared AVs across processes within the same domain. Named AVs allow multiple processes to open the same underlying mapping.
Provider Info
- Provider: A string identifying the provider managing this AV (e.g., "libfabric-uet").
- Type: Reflects the AV organization, usually set to FI_AV_TABLE in Ultra Ethernet. Conceptually, it defines the AV as a simple contiguous table mapping peer FAs to integer handles (fi_addr_t).
- Flags: Behavior-modifying options for the AV. These can enable provider-specific features or modify how insert/remove operations behave.
- Ops: Pointer to the provider-specific operations table (fi_ops_av). This contains function pointers for AV operations such as insert(), remove(), lookup(), and control().
Resource Limits / Configuration
- Count: The expected number of addresses to be inserted into the AV. This guides the provider in allocating resources efficiently and sizing the internal mapping table. Unlike a hard maximum, this is primarily an optimization hint.
- rx_ctx_bits: Used with scalable endpoints. Specifies the number of bits in the returned fi_addr_t that identify a target receive context. Ensures the AV can support multiple receive contexts per endpoint when needed.
- ep_per_node: Indicates how many endpoints are associated with the same Fabric Address. This helps the provider optimize internal data structures when multiple endpoints share the same FA.
- map_addr: Optional base address used when sharing an AV between processes. For shared AVs, this ensures that the same fi_addr_t handles are returned on all processes opening the AV at the same map_addr.
Internal State
- fi_addr_table: Internal array or mapping that connects fi_addr_t handles to actual remote Fabric Addresses (FAs). Initially empty after creation, entries are populated during the AV population phase. Each entry allows the application to refer to peers efficiently without directly using IP addresses or transport-specific identifiers.
No comments:
Post a Comment