The DNPCIE_400G_VU_LL is a PCIe-based FPGA board designed to minimize input to output processing latency on 10-Gbit, 40-Gbit, or 100GbE Ethernet packets. The primary application is for low-cost, low latency, high throughput trading without CPU intervention. Every possible variable that affects input to output latency has been analyzed and minimized. Raw 10/40/100 GbE Ethernet packets can be analyzed and acted upon without a MAC, interrupts, or an operating system adding delay to the process. This configurable hardware computing platform has the ability to achieve the theoretical minimum Ethernet packet processing latency.
1. The FPGA - Xilinx Virtex UltraScale+/UltraScale
We use a single FPGA from the Xilinx Virtex UltraScale+/UltraScale family in the B2104 package. This package supports 702 I/Os with the majority utilized. Most are dedicated to off chip memory peripherals including a single QDRII+ dual port memory and several banks of DDR4 memories. The Virtex UltraScale/UltraScale+ FPGA contains high-speed transceivers capable of 25 GHz. Sixteen of these transceivers are used for a 16-lane GEN3/4 PCIe interface. Four sets of 4 GTY transceivers are connected to QSFP28 sockets for 40/100GbE Ethernet (or 4 channels of 10 GbE). Sixteen addition GTY transceivers are attached to Samtec Firefly connectors and can be used for high speed board to board communication using cables or more 10/40/100GbE ports.
Ten possible UltraScale+/UltraScale FPGAs can be stuffed: VU13P, VU11P, VU9P, VU70, VU5P and VU190, VU160 VU125, VU095, VU080. Two possible Kintex UltraScale FPGAs can be stuffed: KU115, KU095 but note some reduced performance on the GTY interfaces. These FPGAs come in a variety of speed grades (-3, -2E/2I, -1/1L) with -3 the fastest. -2 or faster might is required to achieve the highest clock rates on the memory interfaces. Table 1 depicts the resources of the FPGAs with the Xilinx marketing exaggerations excised. These are large FPGAs, with Kintex being the most cost effective. The VU13P is capable of handling ~20M ASIC gates of logic and remember that the internal FPGA memory and multiplier blocks are not part of this number. UltraScale+ adds large blocks of internal RAM (UltraRAM). Features of the Xilinx UltraScale/UltraScale+ FPGAs include efficient, dual-register 6-input look-up table (LUT) logic, 18 Kb (2 x 9 Kb) block RAMs, and third generation DSP slices (includes 27 x 18 multipliers and 48-bit accumulator). Floating point functions can be implemented using these DSP slices.
2. Low Latency Network Interface
4 channels of 40/100 GbE or a mix of 10 GbE via quad QSFP28
The Virtex UltraScale/UltraScale+ FPGA has transceivers capable of 25 GbE. The physical interface (PHY) is handled using dual QSFP28 modules for 40/100 GbE. With the proper cable this can be split into 4 separate channels of 10 GbE. Raw Ethernet packets can be accessed directly by bypassing the MAC.
QDR II+ SSRAM - Memory with the lowest latency
We use a single quad data rate static RAMs (QDR II+ SSRAM) in the 8M x 18 size (144Mbit). This type of memory has separate input and output data paths enabling maximum read/write data bandwidth with minimum latency. The maximum tested frequency of this memory is 550 MHz. To minimize processing latency, we suspect it will be best to clock this QDRII+ SRAM at 312.50 MHz, exactly twice the internal Ethernet controller frequency of 156.25 MHz. The FPGAs are capable of generating internal 2x clocks that are phase synchronous, eliminating the latencies associated with the tricky re-synchronization of data moving between different clock frequencies. The internal controller can be optimized in any way you choose. We, of course, provide several Verilog examples for no charge that you are welcome to use. All functions of the QDR II+ SSRAM can be exploited, including concurrent read and write operations and four-tick bursts. The only real limitation is the amount of time and effort spent in customizing the individual memory controllers.
4. DDR4 - 16GB of local bulk memory
PC4-2400 DDR4 chips are mounted on the card, providing 5 different banks of DDR4 memory. One bank is configured as 1G x 64. Four additional banks are configured as 1G x 16. Note that the VU11P loses a single bank of 1G x 16 memory. Using a -2 or -3 speed grade FPGA, this memory bank is tested at the maximum FPGA I/O frequency: 1200 MHz (2400 Mb/s with DDR).
To minimize data synchronization across clock boundaries, it probably makes sense to clock this DDR4 interface at a 7x multiple of the base Ethernet frequency of 156.25 MHz, which is 1093.75 MHz A 9x phase synchronous clock can be easily generated internal to the FPGA, allowing zero latency synchronous data transfers between the Ethernet packet receiving logic and the DDR4 memory controller. The DDR4 controller can be optimized in any way you choose. We, of course, provide several Verilog examples for no charge that you are welcome to use. All functions of the DDR4 DRAM can be exploited and optimized. Up to 8 banks can be open at once. Timing variables such as CAS latency and precharge can be tailored to the minimum given your operating frequency and the timing specification of the exact DDR4 memory utilized. As with the QDRII+, the only real limitation is the amount of time and effort spent customizing the DDR4 memory controller to your needs.
5. PCIe - Customizable 16-lane, GEN3/4 PCI Express
PCIe is connected directly to the FPGA via 16-lanes of GTY transceivers. The interface is fully GEN2/GEN3 and GEN4 capable. We ship GEN3 PCIe IP that is a full function, fixed, 16-lane master/target. To gain access to the PCIe interface, this IP must be integrated with your application. The Dini Group PCIe IP provides a flexible interface that allows the user access to multiple DMA engines, scratchpad memories, interrupts, and other endpoint-related functions to maximize performance while utilizing minimal FPGA resources. Drivers (required) for 'C' source for several operating systems are included no charge.
6. How Everything Works...
With direct data feeds such as NASDAQ ITCH and OUCH, the DNPCIE_400G_VU_LL contains all of the basic functions required to minimize the amount of time it takes to receive Ethernet packets, process them, and respond deterministically. By using the FPGA to process Ethernet packets, the processor and operating system are removed from the critical path and traditional sources of latency such as interrupts and context switching no longer hinder performance. Not a single clock cycle is wasted. For algorithms requiring processing, FPGA resources can be hard coded to perform the task. This includes real-time Monte Carlo analysis, and floating point.
l Quad QSFP28 sockets. Each socket can be:
Ø 4-ports 10 GbE or
Ø 1-port 40 GbE or
Ø 1-port 100 GbE (UltraScale+ only)
l 4 separate Samtec Firefly connectors for MTP
Ø 4 GTY lanes per connector
Ø Additional 10/40/100GbE ports or board-to-board connections
l Hosted in a 16-lane GEN3/GEN4 PCIe slot (GEN4 with 8-lanes)
Ø Compatible with
Ø Compatible with
Ø PCIe full, height, GPU length
l Fully compatible with our optional
l Optional FIX board support package ().
l Functioning reference design with:
Ø 10 GbE/40GbE/100GbE MAC
Ø TCP/IP Offload Engine (TOE)
l Up to 128 sessions
Ø FIX protocol parser
Ø PCIe Interface (16-lane, GEN3)
l QRDII+ Controller
l DDR4 Controller
l Xilinx Virtex UltraScale+/UltraScale FPGA (B2104)
Ø , and other third-party debug solutions
l Five FPGA-controlled LEDs
Ø 1 RGB tri-color LED piped to front-panel
Ø 4 Green LED's on-board
Ø Enough debug LEDs to illuminate virtually nothing.