Tuesday, 30 September 2025

Ultra Ethernet: Completion Queue

Completion Queue Creation (fi_cq_open)


Phase 1: Application – Request & Definition


The purpose of this phase is to define the queue where operation completions will be reported. Completion queues are used to report the completion of operations submitted to endpoints, such as data transfers, RMA accesses, or remote write requests. By preparing a struct fi_cq_attr, the application describes exactly what it needs, so the provider can allocate a CQ that meets its requirements.


Example API Call:

struct fi_cq_attr cq_attr = {

    .size = 2048,

    .format = FI_CQ_FORMAT_DATA,

    .wait_obj = FI_WAIT_FD,

    .flags = FI_WRITE | FI_REMOTE_WRITE | FI_RMA,

    .data_size = 64

};


struct fid_cq *cq;

int ret = fi_cq_open(domain, &cq_attr, &cq, NULL);


Explanation of fields:

.size = 2048:  The CQ can hold up to 2048 completions. This determines how many completed operations can be buffered before the application consumes them.

.format = FI_CQ_FORMAT_DATA: This setting determines the level of detail included in each completion entry. With FI_CQ_FORMAT_DATA, the CQ entries contain information about the operation, such as the buffer pointer, the length of data, and optional completion data. If the application uses tagged messaging, choosing FI_CQ_FORMAT_TAGGED expands the entries to also include the tag, allowing the application to match completions with specific operations. The format attribute essentially defines the structure of the data returned when reading the completion queue, letting the application control how much information it receives about each completed operation 

.wait_obj = FI_WAIT_FD: Provides a file descriptor for the application to poll or select on; other options include FI_WAIT_NONE (busy polling) or FI_WAIT_SET.

.flags = FI_WRITE | FI_REMOTE_WRITE | FI_RMA: This field is a bitmask specifying which types of completions the application wants the Completion Queue to report. The provider checks these flags against the capabilities of the domain (fid_domain) to ensure they are supported. If a requested capability is not available, fi_cq_open() will fail. This allows the application to control which events are tracked while the provider manages the underlying resources.

Note: You don’t always need to request every completion type. For example, if your application only cares about local sends, you can set the flag for FI_WRITE and skip FI_REMOTE_WRITE or FI_RMA. Limiting the flags reduces the amount of tracking the provider must do, which can save memory and improve performance, while still giving you the information your application actually needs.

.data_size = 64:  Maximum size of immediate data per entry, in bytes, used for RMA or atomic operations.


Phase 2: Provider – Validation & Limits Check

When the application calls fi_cq_open() with a fi_cq_attr structure, the provider validates each attribute against the parent domain’s capabilities (fid_domain):


fi_cq_attr.size: compared to the domain’s maximum CQ depth.

fi_cq_attr.data_size: compared to the domain’s supported CQ data size.

The total number of CQs requested: limited by the domain’s CQ count.

fi_cq_attr.flags: each requested capability is checked against the domain’s supported features.

If any requested value exceeds the domain’s limits, the provider may adjust it to the maximum allowed or return an error.


Phase 3: Provider – Creation & Handle Return

The purpose of this phase is to allocate memory and internal structures for the CQ and return a handle to the application. The provider creates the fid_cq object in RAM, associates it with the parent domain (fid_domain), and returns the handle. The CQ is now ready to be bound to endpoints (fi_ep_bind) and used for reporting operation completions.

Example fid_cq (Completion Queue) – Illustrative

fid_cq {

    fid_type        : FI_CQ

    fid             : 0xF1DC601

    parent_fid      : 0xF1DD001       

    provider        : "libfabric-uet"

    caps            : FI_WRITE | FI_REMOTE_WRITE | FI_RMA

    size            : 2048           

    format          : FI_CQ_FORMAT_DATA

    wait_obj        : FI_WAIT_FD

    flags           : FI_WRITE | FI_REMOTE_WRITE | FI_RMA

    data_size       : 64        

    provider_data   : <pointer to provider CQ struct>

    ref_count       : 1

    context         : <app-provided void *>

}

Object Example 4-4: Completion Queue (CQ).

Explanation of fields:

fid_type: Type of object, here CQ.

Fid: Unique handle for the CQ object.

parent_fid: Domain the CQ belongs to.

caps: Capabilities supported by this CQ.

size: Queue depth (number of completion entries).

Format: Structure format for completion entries.

wait_obj: Mechanism to wait for completions.

Flags: Requested capabilities for this CQ.

data_size: Maximum size of immediate data per completion entry.

provider_data: Pointer to provider-internal CQ structure.

ref_count: Tracks references to this object.

context: Application-provided context pointer.

Note: In a Completion Queue (fid_cq), the flags field represents the capabilities requested by the application when calling fi_cq_open() (for example, tracking user events, remote writes, or RMA operations). The provider checks these flags against the capabilities of the parent domain (fid_domain). The caps field, on the other hand, shows the capabilities that the provider actually granted to the CQ. This distinction is important because the provider may adjust or limit requested flags to match what the domain supports. In short:

Flags: what the application asked for.

Caps: what the CQ can actually do.


Why EQs and CQs Reside in Host Memory

Event Queues (EQs) and Completion Queues (CQs) are not data buffers in which application payloads are stored. Instead, they are control structures that track the state of communication. When the application posts an operation, such as sending or receiving data, the provider allocates descriptors and manages the flow of that operation. As the operation progresses or completes, the provider generates records describing what has happened. These records typically include information such as completion status, error codes, or connection management events.

Because the application must observe these records to make progress, both EQs and CQs are placed in host memory where the CPU can access them directly. The application typically calls functions like fi_cq_read() or fi_eq_read() to poll the queue, which means that the CPU is actively checking for new records. If these control structures were stored in GPU memory, the CPU would not be able to efficiently poll them, as each access would require a costly transfer over the PCIe or NVLink bus. The design is therefore intentional: the GPU may own the data buffers being transferred, but the coordination, synchronization, and signaling of those transfers are always managed through CPU-accessible queue structures.

Figure 4-5: Objects Creation Process – Completion Queue.

No comments:

Post a Comment