Maps the port and the packet forwarded port label node

2020 6th IEEE International Conference on Network Softwarization (NetSoft)

PolKA: Polynomial Key-based Architecture for Source Routing in Network Fabrics

Cristina Dominicini and Diego Mafioletti	Ana C. Locateli, Rodolfo Villaca,	Alexander Gorodnik
Federal Institute of Education Science	Magnos Martinello, and Mois´es Ribeiro	School of Mathematics
and Technology of Esp´ırito Santo	Federal University of Esp´ırito Santo	University of Bristol
Esp´ırito Santo, Brazil	Esp´ırito Santo, Brazil	Bristol, UK
cristina.dominicini@ifes.edu.br,	magnos@inf.ufes.br, moises@ele.ufes.br {ana.locateli,rodolfo.villaca}@ufes.br,	a.gorodnick@bristol.ac.uk
diego.rossi@ifes.edu.br		a.gorodnick@bristol.ac.uk

architectures evolution. In this sense, an emerging architecture

for software-defined data centers (DCs) and WANs is the

in order to adapt to highly variable traffic patterns.

This is far from a trivial challenge due to the scale, dynam-

However, the resulting designs require large numbers of table

entries [2], are constrained by the limited capacity of switch

1The edge node may be a virtual switch in a server, a hypervisor, a top-of-rack (ToR) switch, or a ingress domain gateway.

Table I: Comparison of routing methods

Edge

1,1,3

Controller

3

	State at core fabric	Forwarding operation	Commodity network hardware?
	table	lookup	�
	packet	pop	�
Integer RNS-based SR	-	integer mod	×
PolKA RNS-based SR	-	polynomial mod	�

Core Network

Figure 1: Example of Port Switching SR for a SDN fabric.

In summary, the contributions of this work are: (i) we propose PolKA, a fully stateless RNS-based SR scheme for network fabrics that is compatible with binary polynomial arithmetic; (ii) we propose a technique to enable the imple-mentation of the polynomial mod operation in P4-enabled programmable switches [11] by reusing Cyclic Redundancy Check (CRC) hardware; (iii) we implement emulated and hardware prototypes to demonstrate that PolKA can achieve similar performance to Port Switching; and (iv) we demon-strate how PolKA can use intrinsic properties from the RNS to enable advanced routing features, like agile path migration, fast failure reaction, and service function chaining (SFC). This paper is structured as follows. Section II discusses related works. Section III presents the mathematical back-ground of PolKA. Section IV presents PolKA, and performs a scalability analysis. Section V studies how to implement PolKA and Port Switching according to P4 architecture, followed by the implementation and evaluation of proof-of-concept prototypes in Sections VI and VII. Finally, Section VIII discusses conclusions and future works.

This section describes the mathematical formulation that supports the proposal of PolKA. More information about finite fields and polynomials rings can be found in [9], [20].

Galois Field of order 2, whose elements are residue classes Polynomial Ring over GF(2): Let GF(2) = {0, 1} be the

Euclidean Division Theorem for Polynomials: Let f(t) and unique polynomials q(t) and r(t) over GF(2) such that f(t) = g(t) be polynomials over GF(2) , where g(t) �= 0. There exist g(t).q(t) + r(t), where either r(t) = 0 or deg(r) < deg(g). The polynomial r(t) is called the remainder of the division of f(t) by g(t), and will be denoted by < f(t) >g(t).

Polynomial congruence: Given f(t), g(t), and h(t) polyno-mials over GF(2), we say that f(t) is congruent to h(t) modulo g(t), and write f(t) ≡ h(t) mod g(t), if h(t) =< f(t) >g(t). Irreducible Polynomials: A non-zero polynomial g(t), is called a divisor of f(t) over GF(2) if f(t) = a(t).g(t), for some polynomial a(t) over GF(2). Two polynomials f(t) and g(t) over GF(2) are coprime if their only common divisor is 1. A non-constant polynomial f(t) over GF(2) is called irreducible over GF(2) if its only divisors are possibly a constant polynomial and itself.

= � M(t)/si(t)
oi(t) · mi(t) · ni(t) (2)

(3)

PolKA architecture is composed of three elements: (i) edge nodes that embed the route labels into the packets, (ii) core nodes that execute basic packet transport, and (iii) a logically centralized SDN Controller that selects routing paths and configures the nodes. In this architecture, the SR relies on three polynomial identifiers over GF(2): (i) routeID: a route identifier, calculated by the controller using the polynomial CRT and embedded into the packet by the edge nodes; (ii) nodeID: an identifier previously assigned to core nodes by the controller in a network configuration phase; and (iii) portID: an identifier assigned to the output ports of each core node.

0	S1	1	1		1	2
0	S1	1	0	S2	0	S3	5
o1 = (1)dec = (1)bin			3		7	6	5
o1 = (1)dec = (1)bin			o2 = (2)dec = (10)bin		o3 = (6)dec = (110)bin

In this paper, all polynomials will be considered as poly-nomials over GF(2). Note that a polynomial f(t) = antn+ an−1tn−1+ . . . + a1t1+ a0t0can be represented by the bit string anan−1 . . . a1a0. Thus, an identifier is represented by a bit string formed by the coefficients of a polynomial, which are either 0 or 1. Also, the bit length of the identifier is len(f). Consider a packet should be routed via a selected path, represented by N core nodes and their respective output ports.

polynomials representing the nodeIDs of the nodes in this path. Let S = {s1(t), s2(t), . . . , sN(t)} be the multiset of the

The routeID is embedded in the packet by the edge element, and the forwarding operation in each core node calculates the output port as the remainder of the euclidean division of the routeID in the packet by its nodeID: oi(t) = < R(t) >si(t) The Controller may proactively compute R(t) or calculate it when the first packet of a flow arrives. On the other hand, core nodes only execute a simple mod operation per packet.

B. Usage example

output port set O is composed by the polynomials: Considering the path defined in Fig. 2 (s1 → s2 → s3), the

o1(t) = 1, o2(t) = t = 10, o3(t) = t2+ t = 110

return length length ← length + deg(elem); � Maximum len(R)

Thus, R(t) must satisfy the conditions of Eq. (5):

M(t) = (t + 1) · (t2+ t + 1) · (t3+ t + 1)
m1(t) = s2(t) · s3(t) = (t2+ t + 1) · (t3+ t + 1) m2(t) = s1(t) · s3(t) = (t + 1) · (t3+ t + 1) m3(t) = s1(t) · s2(t) = (t + 1) · (t2+ t + 1) And solving Eq. (4), we find the polynomials ni(t):

n1(t) = 1,	n2(t) = 1,	n3(t) = t2+ 1

Therefore, packets should embed the routeID 10000, so each node can calculate its portID by dividing this routeID by its own nodeID. For example, the remainder of R(t) = 10000 divided by s3(t) = 1011 is o3(t) = 110 (port label 6).

C. Scalability analysis of the routeID

Topology	nports	diam.	size	Bits for PolKA	Bits for Port Swit.
Two-tier S16 L16*	24	3	32	21	15
Fat-tree 16 pods	16	5	320	55	20
	4	7	20	42	14
	8	7	30	49	21

Table II compares the scalability of PolKA and Port Switch-ing for two common DC topologies (fat-tree and two-tier), and two continental backbone topologies [21] (ARPANET and GEANT2). This topology set covers diverse properties for number of ports, diameter and size. For backbone topologies, as the number of ports varies per node, we considered nports as the maximum number of ports of any node.

The results for the integer RNS-based approach were omit-ted, because they are very close to PolKA. This means that using PolKA instead of the integer version does not incur in reserving extra bits for the routeID header. The Port Switching method consumes less bits for representing the routeID than RNS-based SR approaches, specially in the cases where the size of the topology is high (e.g., fat-tree). These cases demand a large number of irreducible polynomials for PolKA, causing the selection of polynomials of high degree or large integer numbers even when the node does not demand many ports. For the topologies under worst case analysis in Table 1, the maximum len(R) results show that PolKA fits existing packet headers (e.g., 96 bits of Ethernet source and destination addresses, or a stack of MPLS labels with 20 bits per label). When deploying RNS-based SR, the cost to exploit RNS features may be to reserve more bits in the packet header for representing the routeID. Nevertheless, there are known techniques that make optimal assignment of nodeIDs (avoiding the worst case scenario), and reduce routeID length [12], [14].

Authorized licensed use limited to: University of Texas at San Antonio. Downloaded on October 03,2020 at 14:44:07 UTC from IEEE Xplore. Restrictions apply. 329

	start	extract	start	lookahead
	start	ethernet header	start	ethernet header

Ethernet	routeID	IP	data

(b) PolKA header with fixed length
	TYPE_SR?	TYPE_SR?
IPv4 header, encapsulate the SR header, and change the	T	T

		routeID
etherType to TYPE SR=0x1234.		routeID

B. PolKA pipeline

The PolKA header contains a routeID, whose maximum length depends on the network topology, as explained in Section IV-C. Fig. 3(b) shows the format of PolKA header with a fixed length field for storing the routeID, after the Ethernet header. At edge switches, PolKA pipeline for encapsulating the SR header is similar to Sourcey, but the result of the table lookup is an action that sets the output port to the directly connected core switch and encapsulates a single routeID. At core switches, the pipeline of Fig. 4(b) is executed. When etherType is TYPE SR, as PolKA only needs read access to packet headers, it uses the lookahead method of P4 language that evaluates a set of bits from the input packet without advancing the packet index pointer. To discover the output port, the switch has to perform a mod operation between the routeID in the packet and its own nodeID. Besides, in PolKA, there is no information in the header to identify the last hop. This is not a problem in fabric networks, because the packet delivery to an edge switch represents the end of the SR path. At the edge switch, the SR header is removed and the etherType is changed to TYPE IPv4.

end	pop header	end	port = routeID
	pop header		mod nodeID
	set output port		set output port
	& emit packet		& emit packet

Figure 4: P4 pipelines for core switches.

(yellow in Fig. 4(a)). On the other hand, in PolKA (Fig. 4(b)), the packet remains unchanged along the path;• In Sourcey, the output port is directly available in the SR header, while PolKA requires an arithmetic operation over the routeID to calculate the output port.

Authorized licensed use limited to: University of Texas at San Antonio. Downloaded on October 03,2020 at 14:44:07 UTC from IEEE Xplore. Restrictions apply. 330

Therefore, the switch calculates the output port by using two SHIFT, one CRC, and two XOR operations, which is more computationally efficient than executing a division.

With the use of this technique, PolKA can be deployed in P4 targets that allow the configuration of generator polynomials. Since P4 supports CRC operations through the use of external libraries [11], the support for customized polynomials depends on the architecture models and how specific targets implement them. The PSA2and v1model3architectures support cus-tomized polynomials of 16 and 32 bits. In terms of targets, the software switch bmv24and the hardware switch Tofino from Barefoot support customized CRC polynomials, while Netronome SmartNICs only support fixed CRC polynomials.

2https://p4.org/p4-spec/docs/PSA.html
3https://github.com/p4lang/p4c/blob/master/p4include/v1model.p4 4https://github.com/p4lang/behavioral-model
5https://www.netronome.com/
6http://pisces.cs.princeton.edu/
7http://p4.elte.hu/
8https://github.com/intrig-unicamp/macsad/
9https://github.com/p4lang/p4app/tree/rc-2.0.0/

TX	Server S0	P0

	V0
RX	edge.p4	P1	core.p4
RX	V1	P1	core.p4

namespace ns1
SmartNICs

Figure 5: SmartNIC setup.

2) SmartNICs features: Netronome SmartNICs give access to hardware timestamps to P4 programs.They partially imple-ment the functionalities of v1model for P4 16, but it currently supports only a small set of fixed CRC polynomials, which restricts the use of its CRC hardware in PolKA. Nevertheless,

10https://www.sympy.org/

H1		H2	H3	H10
E1		E2	E3	E10
S1		S2	S3	S10

we could use them for measuring forwarding latency in one core node, as proposed in the next section.

3) P4 programs: The P4 programs are adaptations of the codes used in the emulated prototype. The new edge code encapsulates both SR headers and a timestamp header for executing latency measurements, which is composed of an egress (tsout) and an ingress timestamp (tsin).

The test uses the linear fabric topology of Fig. 6 to compare PolKA and Sourcey as the number of hops increases in the core network (e.g., from 0 for path H1 → H11 to 9 for path H1 → H10). The following experiments were executed: (i) round trip time (RTT): host H1 sends 1 ICMP packet/s during 60s to each of the other hosts using ping tool; (ii) jitter: host H1 sends an UDP traffic of 5Mbps (half of link capacity) with big packets to each of the other hosts during 60s using the iperf tool; and (iii) flow completion time (FCT): host H1 transmits a file of 100Mb with big packets over a TCP connection to each of the other hosts using the iperf tool (3 repetitions). Fig. 7 shows the comparison between PolKA and Sourcey solutions for RTT, jitter, and FCT experiments.

In the RTT experiments (Fig. 7(a) and Fig. 7(b)), it is possible to observe that: the RTT grows linearly with the increase of the number of hops for both solutions; (ii) Sourcey solution presents better RTT performance than PolKA solu-tion; and (iii) the standard deviation is small and in the same order of magnitude for both solutions. Besides, there is no significant difference in the results for different packet sizes in the RTT experiments. This is because Mininet does not consider transmission time in the emulation. In addition, jitter (Fig. 7(c)) is small and equivalent for both solutions. Finally, the FCT experiment (Fig. 7(d)) shows that both solutions require approximately the same time to transfer the file and the standard deviation is small.

C. Emulated prototype: Agile path migration in PolKA

This experiment shows how traffic engineering can benefit from SR for bandwidth allocation with agile path migration.

0.15

125

0.05

Number of core hops

Average core latency (us)	14									Average core latency (us)	14										Average core latency (us)	14										Average core latency (us)	14	Sourcey
	14	PolKA									14											14											14
	12										12											12											12
	10										10											10											10
8		1	2	3	4	5	6	7	8	8		1	2	3	4	5	6	7	8	9	8		1	2	3	4	5	6	7	8	9	8		1	2	3	4	5	6	7	8	9
Path length										Path length											Path length											Path length
(a) Low throughput, small packets										(b) Low throughput, big packets

E11	E21	S1	S2	E32	E41		Throughput (Mbps)	10
			L3					5
		E22	E31					5
H11	H21	H22	H31	H32	H41	H42		0
								0	0	25
							Time (s)
		(a) Topology					(b) Throughput at H21

flow A is allocated to blue path (E11 − L1 − S1 − L2 − E21) and flow B is allocated to red path (E12−L1−S1−L3−E31). Thus, these flows compete for bandwidth at link L1−S1 in the interval of 10s to 40s. The effects of TCP congestion control for fair share of total bandwidth can be seen in Fig. 9(b) that shows the throughput at the destination of flow A (H21). At 40s, the traffic engineering decides to exploit idle links and migrates flow A from the blue path to the green path (E11 −L1−S2−L2−E21). From this moment, there is no competition with flow B, so flow A consumes all the link bandwidth.

To perform path migration, the SDN Controller only has to modify a single flow entry at the edge switch E11 for destination H21. The only field that has to be modified is the routeID to embed the new route through green path. Once this single operation is executed, all the packets of flow A that leave H11 will be tagged with the routeID of the new path.

E1	S1	S3	S8		S9		S2
	S1
	2
VMS

	S2
E2	S2	S7		4	S10		S3		arithmetic of PolKA in hardware description languages has
	4						S3		arithmetic of PolKA in hardware description languages has
							S4
		S4	S6		1	E6	S4
		S4	S6		1	E6		S7
VNF		S5	3		VMD		S6		our polynomial scheme using GFs of higher orders.
8		S5	3				S6		ACKNOWLEDGMENT This study was financed by CAPES (Finance Code 001),

									CNPq, FAPES, CTIC, RNP, and EU H2020 (FUTEBOL).

				6
6				6
				4	20	40
				2
0	20	40	60	0
Time (s)					Time (s)
					(d) Throughput at S3

Figure 10: Fast failure reaction for failure of link S4-S6.

veloped a simple control plane application that causes link failures and makes port failure information available to the data plane by populating a table of faulty ports. In addition, we modified the core pipeline to perform a lookup in this table before sending the packet to the output port. If there is a hit, the packet is randomly deflected to one of the other healthy ports. Otherwise, the packet is transmitted normally. The generation of a random value within an interval is provided by v1model and could also be replaced by a hash function if the objective is to always select the same port per flow. Failure detection mechanisms are not in the scope of this work.

[1] M. Casado et al., “Fabric: A retrospective on evolving sdn,” in Proceed-ings of the First Workshop on Hot Topics in Software Defined Networks,
ser. HotSDN ’12.	New York, NY, USA: ACM, 2012, pp. 85–90.


USENIX Association, 2015, p. 15–28. [4] S. A. Jyothi et al., “Towards a flexible data center fabric with source routing,” in Proceedings of the ACM SIGCOMM Symposium on SDN
Research.

			New York, NY,
[6] M. Soliman et al., “Source routed forwarding with software defined con-trol, considerations and implications,” in ACM Conference on CoNEXT
Student Workshop.	New York, NY, USA: ACM, 2012, pp. 43–44.
[14] A. Liberato et al., “RDNA: Residue-Defined Networking Architecture Enabling Ultra-Reliable Low-Latency Datacenters,” IEEE TNSM, Nov 2018. [15] C. K. Dominicini et al., “KeySFC: Traffic steering using strict source routing for dynamic and efficient network orchestration,” Computer Networks, vol. 167, p. 106975, 2020.
[20] I. Herstein, Topics in Algebra, 2nd ed.		John Wiley & Sons, 1975.
[21] S. K. Routray et al., “Statistical model for link lengths in optical transport networks,” J. Opt. Commun. Netw., vol. 5, no. 7, pp. 762–773, Jul 2013.