A typical ethernet packet forwarding (with additional firewall support) involves the following steps
The control plane functionality is related to management of the network node, and exchange packet control (how the user packet will be processed) information with their peers in the network. These protocols typically have less speed requirements but more complex logic implemented in software. They facilitate the packet forwarding by polulating various tables like interface table, L3 routing table, multicast table etc.
Even in a reasonably large toplogy network, control information exchange is not much and mostly happens when a node is inserted, deleted, modified or lose of peerage. With some Rate Limiting in place, most of the conventional processor (PQUICC II, XENON, PIV) can scale very well.
Decoupling of the control plane from data plane restricts data plane to pure packet forwarding functions. This architecture allows the control plane to change independent of the data plane. In a typical configuration, there can be a Control Plane Manager running control protocols and managing multiple data plane line cards.
From the above discussion, we can conclude that the higher processing power is required at the data plane and not in the control plane. ASIC based routers (deployed by Cisco/Juniper) address the high speed processing requirement but lack in flexibility.
Network Processors address the growing need for fast packet processing. With the communication technology approaching wire speeds of OC-192 (9.6Gbps), an architecture that can provides packet processing at wire-speeds is desired. Increase in the wire speed means more packet per second (PPS) to be processed.
To get an idea of the processing rate required for wire speed of 10Gbps, lets take 40 byte IP packet.
frame length = 704 bits(64*8 + 8*8(preamble) + 4*8(FCS) + 96(Interframe gap))
At line rate of 10Gbps, packet rate = 1.42e6 packets/sec
Therefore, on 133MHz processor each packet should get processed in 93.6 cycles !!
In addition to the increased speed, lets not forget the memory latencies incurred while processing a packet. With a processor speed of 1.4GHz, Network devices which were cable of working at OC3 (155 Mbits) need replacements to support network speed of OC12 (4xOC3, 620 Mbits). Conventional, monolithic, packet processing could not able scale to the increased PPS (310K PPS@OC3, 1.2M PPS@OC12). Supporting network speed, for example OC192 (9.6 Gig) is just impossible.
Tackling high-speed traffic is just one of the problem. The If we remove the firewall steps from the data plane processing above, the leftover steps are fairly straightforward and possibly could be achieved with increase in processor speeds and more efficient memory subsystem to support the increased clock rates. But the possible complexity is mounted further with emerging services demanding more involved packet lookup. Services ranging from QoS, network Monitoring, Load balancing to services like Firewall, VPNs which employs complex algorithms for Policy lookup along with Encryption/decryption techniques all at wire speed. Besides the header lookup, services like Real time virus scanning requires payload lookup, which requires a second thought to the current Processor Architecture.
Therefore, to perform complex lookups and operations at wire speeds, multiple processing units becomes a basic requirement of a network processor.
In an ever increasing demand for Multimedia services and ongoing efforts for convergence of telecom/datacom services leading to all-IP based networks, the packet switched networks need to cater to more complex service multiplexing/demultiplexing along with the basic fast packet lookup/forwarding requirement. This only adds to the already difficult-to-achieve wire-speed/cycle-per-packet equilibrium discussed above.
New emerging services also drives the requirements for a Network Processor Architecture in a way that a typical NP should be extensible so as to be able to support new services on the same bandwidth. Therefore flexibility achieved through programmability becomes a key concern for an NP based solution. Besides flexibility, scalability is another key factor where Network Processor architecture is able to scale to higher speed rates or is able to provide more sophisticated packet processing support for QoS/VPN/IPSec solutions.
ASIC – Application Specific Integrated Circuit – ASICs are hardwired implementations that can handle high-speed ingress traffic along with fast lookup and packet processing. But their high processing power comes at the expense of flexibility. Therefore, a logic once built into hardware needs a redesign phase, impacting time-to-market, to scale up to a new requirement.
FPGA – Field Programmable Gate Array – FPGAs provide programmability at the gate level. These are capable of fast packet processing but donot provide the desired flexibility.
Co-processors – Co-processors are hardwired implementations that do not execute any instruction but work asynchronously with a processor. Commands are given to a co-processor, which then provides the result asynchronously to the Processing Unit. These can be used to offload certain specific tasks like complex mathematical computation, encryption/decryption/authentication, table lookup etc.
GPP – (General Purpose Processor) are the most flexible in terms of programmability but does not provide the necessary speed to sustain fast packet processing at line rates. Sequential processing pattern combined with the standard memory synchronous load/store operations, makes GPPs unsuitable for the desired wire-speed packet processing. Moreover, there is a limit to which the clock rates can be increased. Underlying memory technologies have improved with systems now using QDR SRAM and Rambus DRAM along with memory latency hiding techniques like data pre-fetching, split transaction buses, but the improvement lags behind the rate at which processor speeds are increasing.
Current Network system implementations use a combination of GPP and FPGA/ASIC. GPP provides the control plane functionality and handles exceptional traffic whereas data plane (layer 2/3) processing is performed by FPGA/ASIC. These implementations are fast but specially designed to an application needs, therefore cannot scale to new application requirements.
Going by the above discussions it is evident that none of the above system implementations meet all requirements for network packet processing. To meet the requirements of a Network Processor, the architecture of an NP needs to be addressed at both macro and micro level.
Pipelining – In a typical RISC processor, execution of an instruction requires multiple following steps
Pipelining at the micro level utilizes the parallelism at the instruction level. Independent instructions can be executed in parallel in a pipeline. Therefore the execution units can be pipelined enabling the execution of multiple instructions at the same time.
An important thing to note is that in a conventional pipeline, there are multiple instructions being executed at the same time but in different stages.
Pipelining at the macro level utilizes the packet level parallelism (each packet is processed independent of the other packet) inherent in a Network Application and enables Processing units in a pipeline to execute multiple packets at the same time.
Processing Unit – A Processing Unit is the processor executing instructions.
Functional Unit – It is a computational unit that is a part of the pipeline of a Processing Unit. This functional unit can be designated to perform integer or floating point computation. Functional Units are sometimes referred to as the machine width. So, a Processing Unit with 4 Functional Unit can be called a “4-wide” or a “4-issue wide”. Each of the Functional units is called an Execution Slot. Therefore, for a 4-issue wide Processing Unit, there are 4 slots. If each of the Functional Units can do one integer/floating operation per cycle, then we have 4 execution slots per cycle. Keeping these slots busy is critical for the performance of a Processing Unit.
Micro Level Architecture of a network processor employs the same basic performance enhancing techniques that are used to improve performance in a GPP. The idea is to find parallelism to effectively utilize the increasing processing powers. Parallelism is the key in increasing performance. More is the level of parallelism, more work can be extracted from the Execution slots, and higher is the performance. Parallelism exists at multiple levels -
Instruction Level Parallelism (ILP) – instructions that are independent can be executed in parallel.
Thread Level Parallelism (TLP) – multiple threads of execution when executed in parallel. Multiple threads facilitate in running the pipeline if the current thread is stalled for the data from memory.
Pipelining is the most elementary and common feature of processing unit's micro architecture. Processing units depends on the Instruction Level Parallelism (ILP) and thread level parallelism (TLP) in a sequential program implementation. Processing Units implementations can choose between single or multiple Instruction Pipelining.
1-issue wide Pipeline – Going by the above given definitions, a 1-issue wide pipeline has one Execution slot. Therefore, at each cycle only one computation can be performed.
Multiple issue-wide Pipeline – There are more then one (lets say n) Execution slots. Therefore, more than one instruction can have the computation done in a single cycle.
An implementation can have a multi-stage pipelined architecture with branch prediction techniques used to boost performance.
In a Superscalar Architecture implementation, multiple instructions can be initiated simultaneously and executed independently. Superscalar architecture is similar to pipeline with a key difference that in this architecture instead of one, multiple instructions can be executed at each stage. Therefore, it can initiate multiple instructions during the same clock cycle.
In a Superscalar implementation, there are multiple Functional units of the same type along with additional circuitry to dispatch instructions to the units. Superscalar architecture exploits ILP by executing multiple instructions from a single program in a single cycle. Multiple instructions are fetched and fed to multiple independent Functional Units. These instructions may be fed out of order. Efficiency of the dispatcher is important to achieve the desired performance. Normally superscalar implementations have a single thread of control. This kind of architecture is efficient when a high level of ILP exists.
Processes are made up of threads. Each process consists of at least one main thread of execution. Processes can also have multiple threads with each of the threads having its own local context and sharing process's context with other threads. In a non-multithreaded Processing unit, Operating system fakes multithreading by periodically scheduling each of the threads. Every thread is given a time-slice for execution. In reality, there is only a single thread being executed at a time.
A multithreaded Processing unit on the other hand is capable of executing more than one thread at a time. It uses Multiple Functional Unit to execute multiple threads at the same time. Therefore if there are 4-issue slots, 4 independent instructions from a thread A can be executed. In the next cycle, 4 independent instructions of thread B will be executed and so on. Here also there is periodic scheduling of the threads but the time-slice has now been reduced to one cycle. This kind of architecture is helps in hiding memory latency. If one of the threads is waiting for data from the memory, other threads can execute thus keeping the Execution slots busy.
Due to the restriction of executing independent instructions from the same thread in a cycle, a multi-threaded implementation is less efficient when an application does not have enough Instruction Level Parallelism. For instance, if only 2 independent instructions can be extracted from a thread, then the other 2 execution slots in a 4-issue wide machine will go waste.
This kind of architecture is the most efficient when there is high level of ILP and TLP in an application. A high level of TPL is definitely required for a multithreaded implementation to display improved performance.
Multithreaded implementation maintains per thread Instruction Pointer, decode unit and registers, thereby alleviating context switch overhead. Instruction fetch Unit and Execution units and queues are shared amongst all threads
Simultaneous Multi-threading is a step ahead of Multi-threaded architecture. It removes the restriction present in multithreading by allowing instructions from more than one thread to be executed in the same clock. For instance, lets take two threads A and B executing on a machine 4-issue width. If thread A only can extract 2 independent instructions from the current instruction window, then rest of the 2 execution slots can be filled up by 2 independent instructions from thread B.
Simultaneous Multi-threading allows multiple threads to compete for shared resources every cycle. SMT has the ability to extract maximum performance by using both ILP and TLP. If an application has high level of TLP but low ILP, then threads compete for all the slots in a clock cycle thereby reducing the unused execution slots. This can also be said of multiple applications having single/multiple threads. In the event of a non-multithreaded application, all resources are dedicated to the single thread.
Resource replication and sharing is same as in Multithreaded Architecture.
In CMP architecture, independent multiple Processing Units are present on the same chip. Multiprocessors exploit TLP by executing different threads simultaneously on different processors. Multiple processors in CMP implementation have independent resources like Instruction fetch init, Instruction Pointer, registers, cache, and Issue Logic. These processors share only external memory. Each of these multiprocessors can also combine multithreading with multiple Processing Units. In such a combination implementation, each Processing Unit can have hardware support for multiple threads.
Network Processors with Embedded Architectures are typically RISC processors with increased cock speeds and hardware optimized for Network applications. Network Processors implementing this type of architecture has a central core with possibly additional co-processors added to offload certain specific tasks like lookup, maintaining statistics etc. Ingress packets are handled sequentially with one ingress packet processed completely before another can be taken up for processing.
Embedded Architecture usually combine features at micro level. It may employ multi-stage, multi-issue-wide pipeline. These network processors may also be Multi-threaded with hardware optimized to perform one cycle context switch amongst multiple threads. This is achieved by replicating registers including GPRs, status register and Instruction Pointer. The instruction set is also optimized with certain addressing modes like indirect addressing mode disabled and contains special instructions like ones complement add, bit-level manipulaiton etc.
This architecture has the drawback that the processor will be idle for the time, a packet is waiting for the data from the external memory to be present for processing.
Network Processors with parallel architecture have multiple identical Processing Units logically arranged in parallel. Dispatcher logic assigns a Processing Unit to each of the ingress packets. As all units are identical, any packet can go to any of the Processing Units. This kind of architecture requires complex and efficient dispatcher logic and coordination between processing units if any inter-packet dependency exists. It also requires some coordination mechanism at the egress to accept packets from multiple parallel units, manage the egress traffic flow and send them out in order.
At the micro architecture level, the individual Processing Units may employ multi-staged, multi-issue wide pipeline architecture to utilize instruction level parallelism. Even though there are multiple processing units, packets are processed completely by a unit before it can take up a new packet, therefore leading to cycle waste when packet is waiting for data from an external memory.
Instead of a Processing Unit processing a complete packet, packet processing is divided into independent stages, which are then executed by each of the Processing Units in a pipeline. Based on the pipeline paradigm, each stage partially processes the packet and passes the result to the next stage. Each stage in the pipeline is independent and the packet state is not maintained across Processing Units. For example, in a three stage PU pipeline, first Processing Unit does the packet lookup and extraction of the desired fields. This result is then passed on to the Policy Processing Unit, which performs pattern matching and takes policy decision. Decision taken, the result is then passed on to the Egress Processing Unit, which modifies the packet accordingly and sends the packet out.
Following steps summarize Pipelined Packet Processing
It is important to note here that due to the nature of the split processing, the pipeline-stalling problem similar to micro architecture is also faced at the macro level. All stages of the pipeline need to complete packet processing in almost the same time, failing which there would be wasted cycles or “bubbles” in the pipeline. These bubbles adversely affect the performance of the Network Processor.
At the micro architecture level, these Processing Units can implement a multi-threaded with/without multi-issue architecture. If a Processing Unit is implemented with multiple Functional Units, then at any given time, a Processing Unit can handle more than one packet. This is very useful in hiding memory latency when one of the threads is awaiting data from the memory.
Parallel and Pipelined approach can be combined to build a hybrid Network Processor Architecture.
In this architecture, each Processing Unit at a given stage can be replicated to work in parallel. Therefore, depending upon the extent of processing required by each stage to sustain high packet throughput, multiple Processing Units can be employed for a given stage. e.g, in the above three-stage pipeline example, if one Processing Unit is not sufficient for the first stage, multiple Processing Units can be added in parallel to perform the stage one operations. All these multiple stage one PUs can be synchronized to advance their result further in the pipeline. In similar manner, stage two and stage three Processing Units can be added in parallel.
Hybrid Architecture in a way attempts to solve the problem of pipeline stalling in a pure Pipelined Architecture. Placing multiple Processing Units at a slow stage can enable the stage to provide enough packet inputs to the subsequent stage, thus keeping the pipeline moving.
In Parallel, Pipelined and Hybrid Architecture, memory is shared amongst multiple Processing Units. These designs increases the load on the memory communication bus, thereby posing a situation where even having multiple Processing Units adversely affects performance due to slow memory access operations. This issue of Memory latency is addressed in the next section.
Memory accesses remain far slower than the Processing Units. Network Processors use more than one of the following ways to hide memory latency.
Multi-threading
Multiple threads can help reduce wasted cycles by allowing the hardware to execute a different thread if the current thread is stalled waiting for data from the memory. It is important to note hardware support for multi-threading is key to achieve the desired result. Multi-threading at the OS level alone would lead to high overhead in switching the context from the currently stalled thread to the other ready-to-run thread. Therefore, many network processors have separate register sets for each of the threads and hardware unit to perform the switch in one cycle.
Split transaction buses
Having multiple Processing Units, co-processors accessing the common shared memory, traditional load/store kind of instructions are not feasible. In Split transaction buses, memory access request is placed on the command bus and the response is received asynchronously later. These buses therefore enable the pre-fetching of the data much ahead of the processing time thus hiding the memory access latency.
Processing requirements for a Network processor are different; therefore the Instruction set of NP is also optimized to for packet processing. Some of the desired instruction set features are given below -
Both Micro and Macro level architecture features can be combined to form the building blocks of a network Processor
Based on the above discussion, this section briefly describes the macro and micro level features of the various NP architectures present in the market.
--- More to come
maintained by: Anil Kumar (anil.rajput@hsc.com)
page views:
Page Information
|
Wiki Information |
![]() Update to PBwiki 2.0 An entirely new PBwiki experience, including folders and easier editing. |