Contributed by P4 Member, China Mobile
As a leading telecom operator in China, China Mobile has been an active participant and supporter of the open source P4 project since its integration into the Open Networking Foundation (ONF) in 2018 and has remained after P4 moved under the Linux Foundation organization umbrella. We recognize that the openness and flexibility of P4 make it very promising in driving network architecture innovations, accelerating the realization of customized network functions, and enhancing network operational efficiency, all of which are crucial for constructing future network infrastructures.
China Mobile and P4 Community: Contributions and Participation
China Mobile not only actively tracks the latest progress of P4, but also deeply engages in the construction of the P4 open source evolution by participating in community discussions, contributing code, and test cases. China Mobile leads the initiation of the Open SRv6 open source project, which builds the core routing system based on P4 chips. On the data plane, to improve performance and reliability, enable slicing, application-awareness, and other advanced capabilities, this project leverages P4 to support the G-SRv6 forwarding function and basic SRv6 functionality, and further verifies the overall solution from data plane to control plane.
Recently, at the forefront of exploring the future of artificial intelligence (AI) networks, China Mobile integrates P4 with the practical application of Global Scheduling Ethernet (GSE), a next-generation network protocol for AI datacenters. In the subsequent sections, we introduce the basic principles of GSE as well as how P4 is utilized to advance GSE-related technological innovations and practical applications in detail.
Global Scheduling Ethernet: A Next-Generation Network Protocol for AI Datacenters
China Mobile, in collaboration with over 40 global partners including Tencent, China Unicom, Broadcom, Intel, and others, has jointly proposed the Global Scheduling Ethernet (GSE), which aims to enhance the performance of AI datacenter networks to better serve large-scale AI training and deployment. The core technologies in GSE network are as follows:
a) Packet Container-Based Load Balancing Mechanism. The GSE switching network performs packet forwarding and dynamic load balancing based on fixed-length packet container (PKTC). PKTC represents a forwarding mechanism that differs from flow, packet, or CELL-based forwarding. Under this mechanism, Ethernet packets are grouped into logical virtual containers, which are then transmitted as the minimum unit within the switching network. All packets belonging to the same PKTC are load-balanced onto the same path for forwarding, thereby ensuring orderly delivery of packets within the container.
b) Grant-Based Proactive Congestion Control Mechanism. While the PKTC-based load balancing can spread traffic evenly across multiple available paths under steady states, congestion may still occur in situations such as traffic bursts and link failures. To effectively control the amount of data sent to the network and reduce tail latency, GSE introduces a proactive grant mechanism based on Dynamic Global Scheduling Queue (DGSQ) to achieve proactive congestion control that integrates network awareness. The sender creates dynamic DGSQ on demand and associates it with a Queue Pair (QP). Subsequently, the sender requests credits based on the DGSQ status and sends a specified amount of data upon receiving the credit. The receiver grants credits periodically based on the DGSQ status. And during the transmission of credits, certain nodes within the network can also collaboratively adjust the grant of credits in response to congestion or link failures.
Practical Application of P4 in GSE
As previously introduced, GSE revolutionizes the traditional Ethernet forwarding mechanism, introducing new demands on the network such as GSE header processing and packet container construction that exceed the capabilities of traditional chips. This implies that the development and application of GSE necessitate the support of innovative next-generation chips. However, there exists a more streamlined and flexible alternative: programmable switching chips can fulfill the overall realization of GSE’s core functionalities. Compared to traditional chips, the combination of P4 and Domain Specific Architecture (DSA) provides strong support for constructing efficient, flexible, and customizable networks. The P4 language enables network designers to describe the processing logic of data plane in an abstract manner, while the DSA architecture provides a hardware and software framework to support the P4 processing engine. Their combination enables supporting of features such as global programmability, user-defined functionality, and rapid requirement iteration, which helps us rapidly develop GSE’s network functions and protocols based on practical demands and deploy them into the network.
At present, we have successfully implemented multiple important functions of GSE through P4, including GSE header processing, packet container construction, and stateful load balancing.
a) GSE Header Processing: We utilize Run-to-Completion (RTC) full process programmability to define the analyzing of GSE packets, enabling encapsulation and decapsulation of GSE packets while ensuring continued support for the evolution of GSE header.
b) Packet Container Construction: We use the Match-Action Unit (MAU) module of P4 to support and implement the packet container of GSE, allowing packets to be directly encapsulated with the GSE header and forwarded immediately after obtaining the container ID through table lookup, rather than waiting for subsequent packets of this container to arrive.
c) Stateful Load Balancing: For constructing stateful load balancing based on packet container, we use the MAU module of P4 to uniformly implement stateful load balancing across multiple queues on the data plane, and effectively avoid potential conflicts through congestion awareness. Additionally, the port selection information of a packet container during load balancing is also recorded on the data plane, allowing subsequent packets of the same container to rapidly query and reuse this port.
d) Low-Latency Forwarding: Programming based on the P4 DSA pipeline architecture is leveraged to quickly and continuously process packet headers, achieving microsecond (μs) packet forwarding latency. In addition, for critical packets (e.g., credit packets) that do not require editing at the forwarding nodes, a priority processing mechanism is adopted to allow them to bypass pipeline stages and be forwarded as priority, thereby achieving ultra-low latency.
We have developed the prototype of GSE using P4 and Field Programmable Gate Array (FPGA) chips, and primarily validated the performance of GSE using the GSE prototype in a testbed with 32 NICs. Experimental results demonstrate that compared to traditional RoCE, GSE reduces the job completion time (JCT) in all-to-all communication pattern by 2 to 3 times, achieving significant performance improvements.
In summary, as a domain-specific language for specifying the behavior of programmable data planes, P4 plays a crucial role in the implementation of GSE’s key technologies and functionalities. In the future, China Mobile will continue to promote the deep integration of GSE and P4 and accelerate the practical application of P4 in GSE.