HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement.
Network Processor Overview
Introduction
Packet processing has been always a challenge for both hardware and software designer. Speed of the networks has grown from few kilobytes to few gigabytes and it is going to grow in future. Besides the basic demand for high speed spacket processing units, new services also dictate the requirements for Network Processor Architecture. To cater to these requirements, packet processors have evolved from software based processors to ASIC/FPGA centric devices to more flexible and configurable/programmable Network Processors.
What is Network Processor
A Network Processor is an integrated circuit which has a feature set specifically targeted at the networking application domain operating at extremely high speeds.
Packet Processing Overview
To understand the above definition fully, the next few sections will analyze various factors that led to evolution of software based routers, typically supporting 10Mbps interfaces to routers, which now boast of handling 10Gbps traffic and with processing capabilities in excess to 20 Gig operations per second.
A typical ethernet packet forwarding (with additional firewall support) involves the following steps
- Packet is received at the MAC by the driver
- Packet is buffered in external memory
- Driver fetches layer2 header
- Driver performs the layer2 level checks which was not performed by MAC hardware
- Driver hand over the packet to the layer3
- Layer3 perform following functions one after another
- Fetch layer3 and layer4 header
- Layer 3 header verification performed on the packet. If header verified, further steps taken else packet is dropped
- If packet is self terminated, packet is handed over to higher layers in protocol stack
- Search the firewall policies and excecute the rules associated with the matching policy
- Fetch the layer3 header and search the routing table to get the next hop IP address and/or interface
- Search the firewall policies and execute the rules associated with the matching policy
- Pass the packet along with next hop IP Address, and/or Interface to layer2
Layer2 resolves the mac address
Add/update the L2 header and copy the packet to transmit descriptor of the packet
Release all resources allocated for the packet
If observed closely, all the above steps typically require parsing of fields in the packet headers at fixed locations. For such processing, pre-configured tables are used. Some state is maintained per packet (may span multiple packets incase of fragmentation and reassembly) but there is no complex state machine involved in packet processing. This lead to a whole new understanding of packet processing based on which monolithic way of packet processing can now be divided into Control Plane processing and Data Plane processing.
Control Plane
The control plane functionality is related to management of the network node, and exchange packet control (how the user packet will be processed) information with their peers in the network. These protocols typically have less speed requirements but more complex logic implemented in software. They facilitate the packet forwarding by polulating various tables like interface table, L3 routing table, multicast table etc.
Even in a reasonably large toplogy network, control information exchange is not much and mostly happens when a node is inserted, deleted, modified or lose of peerage. With some Rate Limiting in place, most of the conventional processor (PQUICC II, XENON, PIV) can scale very well.
Data Plane
Decoupling of the control plane from data plane restricts data plane to pure packet forwarding functions. This architecture allows the control plane to change independent of the data plane. In a typical configuration, there can be a Control Plane Manager running control protocols and managing multiple data plane line cards.
From the above discussion, we can conclude that the higher processing power is required at the data plane and not in the control plane. ASIC based routers (deployed by Cisco/Juniper) address the high speed processing requirement but lack in flexibility.
Why Need Network Processor
Speed
Network Processors address the growing need for fast packet processing. With the communication technology approaching wire speeds of OC-192 (9.6Gbps), an architecture that can provides packet processing at wire-speeds is desired. Increase in the wire speed means more packet per second (PPS?) to be processed.
To get an idea of the processing rate required for wire speed of 10Gbps, lets take 40 byte IP packet.
frame length = 704 bits(64*8 + 8*8(preamble) + 4*8(FCS) + 96(Interframe gap))
At line rate of 10Gbps, packet rate = 1.42e6 packets/sec
Therefore, on 133MHz processor each packet should get processed in 93.6 cycles !!
In addition to the increased speed, lets not forget the memory latencies incurred while processing a packet. With a processor speed of 1.4GHz, Network devices which were cable of working at OC3 (155 Mbits) need replacements to support network speed of OC12 (4xOC3, 620 Mbits). Conventional, monolithic, packet processing could not able scale to the increased PPS (310K PPS@OC3, 1.2M PPS@OC12). Supporting network speed, for example OC192 (9.6 Gig) is just impossible.
Complexity
Tackling high-speed traffic is just one of the problem. The If we remove the firewall steps from the data plane processing above, the leftover steps are fairly straightforward and possibly could be achieved with increase in processor speeds and more efficient memory subsystem to support the increased clock rates. But the possible complexity is mounted further with emerging services demanding more involved packet lookup. Services ranging from QoS, network Monitoring, Load balancing to services like Firewall, VPNs which employs complex algorithms for Policy lookup along with Encryption/decryption techniques all at wire speed. Besides the header lookup, services like Real time virus scanning requires payload lookup, which requires a second thought to the current Processor Architecture.
Therefore, to perform complex lookups and operations at wire speeds, multiple processing units becomes a basic requirement of a network processor.
Flexibility
In an ever increasing demand for Multimedia services and ongoing efforts for convergence of telecom/datacom services leading to all-IP based networks, the packet switched networks need to cater to more complex service multiplexing/demultiplexing along with the basic fast packet lookup/forwarding requirement. This only adds to the already difficult-to-achieve wire-speed/cycle-per-packet equilibrium discussed above.
New emerging services also drives the requirements for a Network Processor Architecture in a way that a typical NP should be extensible so as to be able to support new services on the same bandwidth. Therefore flexibility achieved through programmability becomes a key concern for an NP based solution. Besides flexibility, scalability is another key factor where Network Processor architecture is able to scale to higher speed rates or is able to provide more sophisticated packet processing support for QoS/VPN/IPSec solutions.
Lets consider possible system implementations to meet the above packet processing requirements for a Network Application.
ASIC Application Specific Integrated Circuit ASICs are hardwired implementations that can handle high-speed ingress traffic along with fast lookup and packet processing. But their high processing power comes at the expense of flexibility. Therefore, a logic once built into hardware needs a redesign phase, impacting time-to-market, to scale up to a new requirement.
FPGA Field Programmable Gate Array FPGAs provide programmability at the gate level. These are capable of fast packet processing but donot provide the desired flexibility.
Co-processors Co-processors are hardwired implementations that do not execute any instruction but work asynchronously with a processor. Commands are given to a co-processor, which then provides the result asynchronously to the Processing Unit. These can be used to offload certain specific tasks like complex mathematical computation, encryption/decryption/authentication, table lookup etc.
GPP (General Purpose Processor) are the most flexible in terms of programmability but does not provide the necessary speed to sustain fast packet processing at line rates. Sequential processing pattern combined with the standard memory synchronous load/store operations, makes GPPs unsuitable for the desired wire-speed packet processing. Moreover, there is a limit to which the clock rates can be increased. Underlying memory technologies have improved with systems now using QDR SRAM and Rambus DRAM along with memory latency hiding techniques like data pre-fetching, split transaction buses, but the improvement lags behind the rate at which processor speeds are increasing.
Current Network system implementations use a combination of GPP and FPGA/ASIC. GPP provides the control plane functionality and handles exceptional traffic whereas data plane (layer 2/3) processing is performed by FPGA/ASIC. These implementations are fast but specially designed to an application needs, therefore cannot scale to new application requirements.
Going by the above discussions it is evident that none of the above system implementations meet all requirements for network packet processing. To meet the requirements of a Network Processor, the architecture of an NP needs to be addressed at both macro and micro level.
Before we explore the possible Micro and Macro Level Architecture for a Network Processor, few terms require a formal definition.
Pipelining In a typical RISC processor, execution of an instruction requires multiple following steps
- Fetch instruction from memory
- Decode instruction
- Execute the instruction or read the address
- Access operand in data memory
- Write back the result in register
- Write the output in the memory
Each of these steps can be treated as a stage in a pipeline, thus enabling the overlapping of stages. For instance, an instruction can be at the decode stage when a new instruction is being fetched by the Instruction Fetch unit and so on. At a given instance of time, multiple instructions are being worked upon in a pipeline. This is the basic principle of a pipeline.
Pipelining at the micro level utilizes the parallelism at the instruction level. Independent instructions can be executed in parallel in a pipeline. Therefore the execution units can be pipelined enabling the execution of multiple instructions at the same time.
An important thing to note is that in a conventional pipeline, there are multiple instructions being executed at the same time but in different stages.
Pipelining at the macro level utilizes the packet level parallelism (each packet is processed independent of the other packet) inherent in a Network Application and enables Processing units in a pipeline to execute multiple packets at the same time.
Processing Unit A Processing Unit is the processor executing instructions.
Functional Unit It is a computational unit that is a part of the pipeline of a Processing Unit. This functional unit can be designated to perform integer or floating point computation. Functional Units are sometimes referred to as the machine width. So, a Processing Unit with 4 Functional Unit can be called a 4-wide or a 4-issue wide. Each of the Functional units is called an Execution Slot . Therefore, for a 4-issue wide Processing Unit, there are 4 slots. If each of the Functional Units can do one integer/floating operation per cycle, then we have 4 execution slots per cycle. Keeping these slots busy is critical for the performance of a Processing Unit.
Micro Level Architecture
Micro Level Architecture of a network processor employs the same basic performance enhancing techniques that are used to improve performance in a GPP. The idea is to find parallelism to effectively utilize the increasing processing powers. Parallelism is the key in increasing performance. More is the level of parallelism, more work can be extracted from the Execution slots, and higher is the performance. Parallelism exists at multiple levels -
Instruction Level Parallelism (ILP) instructions that are independent can be executed in parallel.
Thread Level Parallelism (TLP) multiple threads of execution when executed in parallel. Multiple threads facilitate in running the pipeline if the current thread is stalled for the data from memory.
Pipelining
Pipelining is the most elementary and common feature of processing unit's micro architecture. Processing units depends on the Instruction Level Parallelism (ILP) and thread level parallelism (TLP) in a sequential program implementation. Processing Units implementations can choose between single or multiple Instruction Pipelining.
1-issue wide Pipeline Going by the above given definitions, a 1-issue wide pipeline has one Execution slot. Therefore, at each cycle only one computation can be performed.
Multiple issue-wide Pipeline There are more then one (lets say n) Execution slots. Therefore, more than one instruction can have the computation done in a single cycle.
An implementation can have a multi-stage pipelined architecture with branch prediction techniques used to boost performance.
Superscalar Architecture
In a Superscalar Architecture implementation, multiple instructions can be initiated simultaneously and executed independently. Superscalar architecture is similar to pipeline with a key difference that in this architecture instead of one, multiple instructions can be executed at each stage. Therefore, it can initiate multiple instructions during the same clock cycle.
In a Superscalar implementation, there are multiple Functional units of the same type along with additional circuitry to dispatch instructions to the units. Superscalar architecture exploits ILP by executing multiple instructions from a single program in a single cycle. Multiple instructions are fetched and fed to multiple independent Functional Units. These instructions may be fed out of order. Efficiency of the dispatcher is important to achieve the desired performance. Normally superscalar implementations have a single thread of control. This kind of architecture is efficient when a high level of ILP exists.
Multi-threading Architecture
Processes are made up of threads. Each process consists of at least one main thread of execution. Processes can also have multiple threads with each of the threads having its own local context and sharing process's context with other threads. In a non-multithreaded Processing unit, Operating system fakes multithreading by periodically scheduling each of the threads. Every thread is given a time-slice for execution. In reality, there is only a single thread being executed at a time.
A multithreaded Processing unit on the other hand is capable of executing more than one thread at a time. It uses Multiple Functional Unit to execute multiple threads at the same time. Therefore if there are 4-issue slots, 4 independent instructions from a thread A can be executed. In the next cycle, 4 independent instructions of thread B will be executed and so on. Here also there is periodic scheduling of the threads but the time-slice has now been reduced to one cycle. This kind of architecture is helps in hiding memory latency. If one of the threads is waiting for data from the memory, other threads can execute thus keeping the Execution slots busy.
Due to the restriction of executing independent instructions from the same thread in a cycle, a multi-threaded implementation is less efficient when an application does not have enough Instruction Level Parallelism. For instance, if only 2 independent instructions can be extracted from a thread, then the other 2 execution slots in a 4-issue wide machine will go waste.
This kind of architecture is the most efficient when there is high level of ILP and TLP in an application. A high level of TPL is definitely required for a multithreaded implementation to display improved performance.
Multithreaded implementation maintains per thread Instruction Pointer, decode unit and registers, thereby alleviating context switch overhead. Instruction fetch Unit and Execution units and queues are shared amongst all threads
Simultaneous Multi-threading (SMT) Architecture
Simultaneous Multi-threading is a step ahead of Multi-threaded architecture. It removes the restriction present in multithreading by allowing instructions from more than one thread to be executed in the same clock. For instance, lets take two threads A and B executing on a machine 4-issue width. If thread A only can extract 2 independent instructions from the current instruction window, then rest of the 2 execution slots can be filled up by 2 independent instructions from thread B.
Simultaneous Multi-threading allows multiple threads to compete for shared resources every cycle. SMT has the ability to extract maximum performance by using both ILP and TLP. If an application has high level of TLP but low ILP, then threads compete for all the slots in a clock cycle thereby reducing the unused execution slots. This can also be said of multiple applications having single/multiple threads. In the event of a non-multithreaded application, all resources are dedicated to the single thread.
Resource replication and sharing is same as in Multithreaded Architecture.
Chip Multiprocessor (CMP) Architecture
In CMP architecture, independent multiple Processing Units are present on the same chip. Multiprocessors exploit TLP by executing different threads simultaneously on different processors. Multiple processors in CMP implementation have independent resources like Instruction fetch init, Instruction Pointer, registers, cache, and Issue Logic. These processors share only external memory. Each of these multiprocessors can also combine multithreading with multiple Processing Units. In such a combination implementation, each Processing Unit can have hardware support for multiple threads.
Macro Level Architecture
Embedded Architecture
Network Processors with Embedded Architectures are typically RISC processors with increased cock speeds and hardware optimized for Network applications. Network Processors implementing this type of architecture has a central core with possibly additional co-processors added to offload certain specific tasks like lookup, maintaining statistics etc. Ingress packets are handled sequentially with one ingress packet processed completely before another can be taken up for processing.
Embedded Architecture usually combine features at micro level. It may employ multi-stage, multi-issue-wide pipeline. These network processors may also be Multi-threaded with hardware optimized to perform one cycle context switch amongst multiple threads. This is achieved by replicating registers including GPRs, status register and Instruction Pointer. The instruction set is also optimized with certain addressing modes like indirect addressing mode disabled and contains special instructions like ones complement add, bit-level manipulaiton etc.
This architecture has the drawback that the processor will be idle for the time, a packet is waiting for the data from the external memory to be present for processing.
Parallel Architecture
Network Processors with parallel architecture have multiple identical Processing Units logically arranged in parallel. Dispatcher logic assigns a Processing Unit to each of the ingress packets. As all units are identical, any packet can go to any of the Processing Units. This kind of architecture requires complex and efficient dispatcher logic and coordination between processing units if any inter-packet dependency exists. It also requires some coordination mechanism at the egress to accept packets from multiple parallel units, manage the egress traffic flow and send them out in order.
At the micro architecture level, the individual Processing Units may employ multi-staged, multi-issue wide pipeline architecture to utilize instruction level parallelism. Even though there are multiple processing units, packets are processed completely by a unit before it can take up a new packet, therefore leading to cycle waste when packet is waiting for data from an external memory.
Pipeline Architecture
Instead of a Processing Unit processing a complete packet, packet processing is divided into independent stages, which are then executed by each of the Processing Units in a pipeline. Based on the pipeline paradigm, each stage partially processes the packet and passes the result to the next stage. Each stage in the pipeline is independent and the packet state is not maintained across Processing Units. For example, in a three stage PU pipeline, first Processing Unit does the packet lookup and extraction of the desired fields. This result is then passed on to the Policy Processing Unit, which performs pattern matching and takes policy decision. Decision taken, the result is then passed on to the Egress Processing Unit, which modifies the packet accordingly and sends the packet out.
Following steps summarize Pipelined Packet Processing
- A packet is processed in Stage i independent of i-1 and i+1 .
- A Packet moves through the pipeline from Stage i to stage i+1 sequentially.
- A packet has associated with it packet context and meta data. Packet context can be viewed as a temporary area, which used by the different pipeline stages to exchange information across the stages, for example pointer to routing entry. Meta data is normally used to store the search keys, for example 5 tuple (source IP, Destination IP, Source Port, Destination Port, Ingress Port), and to incrementally build new packet header.
- Resource allocation and de-allocation is done in a manner that does not break above rules.
Packet processing in this manner immediatly boosts the system performance many times (roughly equal to the number of stages times). A packet is evicted every pipeline cycle?. Though a packet still spent roughly the same time as in conventional processing model. Therefore, at any time, there are multiple packets being processed in the Processing Unit pipeline.
It is important to note here that due to the nature of the split processing, the pipeline-stalling problem similar to micro architecture is also faced at the macro level. All stages of the pipeline need to complete packet processing in almost the same time, failing which there would be wasted cycles or bubbles in the pipeline. These bubbles adversely affect the performance of the Network Processor.
At the micro architecture level, these Processing Units can implement a multi-threaded with/without multi-issue architecture. If a Processing Unit is implemented with multiple Functional Units, then at any given time, a Processing Unit can handle more than one packet. This is very useful in hiding memory latency when one of the threads is awaiting data from the memory.
Hybrid Architecture
Parallel and Pipelined approach can be combined to build a hybrid Network Processor Architecture.
In this architecture, each Processing Unit at a given stage can be replicated to work in parallel. Therefore, depending upon the extent of processing required by each stage to sustain high packet throughput, multiple Processing Units can be employed for a given stage. e.g, in the above three-stage pipeline example, if one Processing Unit is not sufficient for the first stage, multiple Processing Units can be added in parallel to perform the stage one operations. All these multiple stage one PUs can be synchronized to advance their result further in the pipeline. In similar manner, stage two and stage three Processing Units can be added in parallel.
Hybrid Architecture in a way attempts to solve the problem of pipeline stalling in a pure Pipelined Architecture. Placing multiple Processing Units at a slow stage can enable the stage to provide enough packet inputs to the subsequent stage, thus keeping the pipeline moving.
In Parallel, Pipelined and Hybrid Architecture, memory is shared amongst multiple Processing Units. These designs increases the load on the memory communication bus, thereby posing a situation where even having multiple Processing Units adversely affects performance due to slow memory access operations. This issue of Memory latency is addressed in the next section.
Memory Access Latency
Memory accesses remain far slower than the Processing Units. Network Processors use more than one of the following ways to hide memory latency.
Multi-threading
Multiple threads can help reduce wasted cycles by allowing the hardware to execute a different thread if the current thread is stalled waiting for data from the memory. It is important to note hardware support for multi-threading is key to achieve the desired result. Multi-threading at the OS level alone would lead to high overhead in switching the context from the currently stalled thread to the other ready-to-run thread. Therefore, many network processors have separate register sets for each of the threads and hardware unit to perform the switch in one cycle.
Split transaction buses
Having multiple Processing Units, co-processors accessing the common shared memory, traditional load/store kind of instructions are not feasible. In Split transaction buses, memory access request is placed on the command bus and the response is received asynchronously later. These buses therefore enable the pre-fetching of the data much ahead of the processing time thus hiding the memory access latency.
Network Processor Instruction Set
Processing requirements for a Network processor are different; therefore the Instruction set of NP is also optimized to for packet processing. Some of the desired instruction set features are given below -
- Bit Manipulation support modifying selective bits/bytes in the packet header
- Data Movement - In pure RISC processor, data movement is allowed only between data-memory and registers. For a network applications data movement between memory and I/O buffers, registers and I/O buffers and I/O buffer to I/O buffer is also desired.
- Block load Instrucitons that can move a block of data to/from memory and I/O buffers.
- Atomic instruction - used to implement synchronization mechanism. Bit test/set, atomic increment/decrement.
- Counting leading zeros/ones
- Instruction to return the first set bit
- Instructions to facilitate memory prefetching
- Atomic instructions for bit test and set, increment, decrement
Building Blocks for Network Processor
Both Micro and Macro level architecture features can be combined to form the building blocks of a network Processor
- Pattern Processing Engine
- Policy Engine for Packet Classification
- Statistics Engine
- Routing Engine
- Queuing Engine
- Data Buffer Controller
- CRC/Checksum Engine
- Traffic manager Engine
- Traffic Shaper Engine
Network Processor Implementation Examples
Based on the above discussion, this section briefly describes the macro and micro level features of the various NP architectures present in the market.
- AMCC
- Wintegra
- Bay Microsystems
- Agere
- EZChip
- Intel
--- More to come
maintained by: Anil Kumar (anil.rajput@hsc.com)
page views:###
Categories: Broadband
Comments