XILINX 63234 END FPGA Distributor
Important Note: This downloadable PDF of an Answer Record is provided to enhance its usability and readability. It is important to note that Answer Records are Web-based content that is frequently updated as new information becomes available. You are reminded to visit the Xilinx Technical Support Website and review (Xilinx Answer 63234) for the latest version of this Answer.
Introduction
Because of the way DDR2 and DDR3 memories are architected and the MIG 7 series controller is designed, performance is not straightforward. It requires an understanding of various Jedec Timing parameters and controller Architecture, and you will need to run simulations to get the estimates. The general principle for determining performance is the same, but this document provides an easy way to obtain efficiency using the MIG example design with the help of the test bench and stimulus files attached here.
Effective Bandwidth
The DRAM data bus achieves near peak bandwidth only during bursts of read and write, and its overhead lowers the effective data rate.
A few examples of overhead are
- precharge time accessing rows in the same bank (Access address not in the same row-page hit)
- write recovery time to change from write to read access
- bus turnaround time to change from read to write access
Clock cycles transferring data
- Efficiency (%) = ——————————————-
Total clock cycles
Effective Bandwidth = Peak Bandwidth * Efficiency
MIG Design Generation
- Refer to UG586 Chapter 1 for step-by-step details on MIG IP and example design generation.
- Before running the MIG 7 Series performance simulation, do the following to make sure your simulation environment is fine.
- Open the MIG example design and map the appropriate libraries, run the simulation, and ensure that you can see the message “test passed” in the transcript.
- To demonstrate the flow, I have generated an MIG IP for xc7vx690tffg1761-2 and invoked the example design.
- Two things that should be noted are memory address bits and memory address mapping selection.
- For example, I have selected MT41J128M8XX-125 under the memory part drop-down options.
For the selected memory part from Figure-1, row = 14, column = 10 and bank = 3, so app_addr_width = row + column + bank + rank= 28
You can select either BANK_ROW_COLUMN or ROW BANK_COLUMN.
I have left the ROW BANK Column, which is the default address mapping.
Example design Simulation with synthesizable test bench
- Under Simulation settings, select QuestaSim/ModelSim Simulator and browse to the compiled libraries location.
- For details on pointing to the third-party tools install path, selecting the target simulator, and compiling and mapping libraries, you can refer to (UG900) Vivado Design Suite User Guide Logic Simulation.
Simulate the GUI (Click the Run Simulation Tab in the project manager) and make sure you see the “test passed” message in the transcript.
Performance Simulation RTL modifications
- Right click the sources tab, select “add or create simulation sources”, browse to the mig7_perfsim_traffic_generator.sv file and click finish to add it.
- Right click the sources tab, select “add or create simulation sources”, browse to perfsim_stimulus.txt, and click finish adding it.
- Comment out the example_top instantiation in the sim_tb_top.v file.
- Add the below RTL lines to sim_tb_top,v
- Modify APP_ADDR_WIDTH, APP_DATA_WIDTH, RANK_WIDTH, H, and BANK_WIDTH according to your memory part selection. Values can be obtained from the _mig.v file.
- The yellow highlighted instantiation name mig_7series_0_mig can vary based on your component name during IP creation.n, Verify if you have chosen a different name and change it accordingly.
- Once the IP is generated open the _mig.v file and cross check for any variations in LHS signal names and correct them.
- app_sr_req, app_ref_req, and app_zq_req should be initialized to 0.
- As example_top.v is commented out and new files are added, you will probably see “?” beside the mig_7series_0_mig.v file under simulation sources.
- To map the correct file, right-click mig_7series_0_mig.v, select “Add Sources”, browse to <Project.directory>/mig_7series_0_example.srcs/sources_1/ip/mig_7series_0/mig_7series_0/user_design/rtl and add the mig_7series_0_mig_sim.v file.
- If you see “?” for the underlying files, add all RTL files in the clocking, controller, ip_top,phy, and UI folders.
- Once the RTL changes are done and all of the required files are added to your Simulation Sources, the Hierarchyshould be similar to Figure 5.
- The files highlighted in red are newly added, and “?” is expected on ECC-related modules as the selected memory configuration has the ECC option disabled.
Stimulus File Description
Each stimulus pattern is 48 bits, and the format is described in Figures 6-1 through 6-4.
Address Encoding (Address [35:0])
The address is encoded in the stimulus as per Figure 7-1 to Figure 7-6. All of the address fields need to be entered in the hexadecimal format.
All of the address fields are a width that is divisible by four to enter in the hexadecimal format. The test bench only sends the required bits of an address field to the Memory Controller. For example, in an eight bank configuration, only bank Bits [2:0] are sent to the Memory Controller, and the remaining bits are ignored. The extra bits for an address field are provided for you to enter the address in a hexadecimal format. You must confirm that he value entered corresponds to the width of a given configuration.
- Column Address (Column[11:0]) – Column Address in the stimulus is provided to a maximum of 12 bits, but you need to address this based on the column width parameter set in your design.
- Row Address (Row[15:0]) – Row address in the stimulus is provided to a maximum of 16 bits, but you need to address
- This is based on the row width parameter set in your design.
- Bank Address (Bank[3:0]) – The Bank address in the stimulus is provided to a maximum of four bits, but you need to address this based on the bank width parameter set in your design.
- Rank Address (Rank[3:0]) – Rank address in the stimulus is provided to a maximum of four bits, but you need to address this based on the rank width parameter set in your design.
- The address is assembled based on the top-level MEM_ADDR_ORDER parameter and sent to the user interface.
Command Repeat (Command Repeat [7:0])
- The command repetition count is the number of times the respective command is repeated at the User Interface. The address for each repetition is incremented by 8. The maximum repetition count is 128.
- The test bench does not check for the column boundary, and it wraps around if the maximum column limit is reached during the increments.
- The 128 Commands fill up the page. For any column address other than 0, the repetition count of 128 ends up crossing.
- The column boundary wraps around to the start of the column address.
Bus Utilization
The bus utilization is calculated at the User Interface, taking the total number of Reads and writes into consideration, and the following equation is used:
- BL8 takes four memory clock cycles
- End_of_stimulus is the time when all the commands are done.
- calib_done is the time when the calibration is done.
Example Patterns
These examples are based on the MEM_ADDR_ORDER set to BANK_ROW_COLUMN.
Single Read Pattern
00_0_2_000F_00A_1 – This pattern is a single read from 10th column, 15th row, and second bank.
Single Write Pattern
00_0_1_0040_010_0 – This pattern is a single write to the 32nd column, 128th row, and first bank.
Single Write and Read to Same Address
- 00_0_2_000F_00A_0 – This pattern is a single write to 10th column, 15th row, and second bank.
- 00_0_2_000F_00A_1 – This pattern is a single read from 10th column, 15th row, and second bank
Multiple Writes and Reads with Same Address
- 0A_0_0_0010_000_0 – This corresponds to 10 writes with addresses starting from 0 to 80, which can be seen in the column.
- 0A_0_0_0010_000_1 – This corresponds to 10 reads with address starting from 0 to 8,0, which can be seen in the column.
Page Wrap During Writes
0A_0_2_000F_3F8_0 – This corresponds to 10 writes with column address wrapped to the start of the page after one write.
Simulating the Performance Traffic Generator
At this point, you are done with the MIG example design simulation. This implies that your simulation setup is ready, you have done performance simulation RTL modifications, the new simulation hierarchy is correct, and you have understood the stimulus patterns. Run the simulation once again with 16 writes and reads in perfsim_stimulus.txt.
- Do run all, wait until the init_calib_complete signal is asserted, and you will be able to see the proposed number of writes and reads. The simulation will then stop.
- When you are prompted to quit the simulation, select No and go to the transcript window, where you will be able to see the performance statistics.
- If you select “quit simulation,” performance statistics will be written to a file named mig_band_width_output.txt located in the sim_1/behave folder.
- Example directory path:- /mig_7series_0_example_perf_sim\mig_7series_0_example.sim/sim_1/behav
You might wonder why the percentage of bus utilization is only 29. Rerun the simulation with the same IP settings, but just changing the stimulus file to 256 writes and 256 reads
- ff_0_0_0000_000_0
- ff_0_0_0000_000_1
You will now see the percentage as 85, which implies that DDR3 offers better bus utilization for a long sequence of writes and read bursts.
General ways to improve Performance
The factors that influence efficiency can be divided into two sections:
- Memory Specific
- Controller Specific
Figure 9 gives you an overview of the terms that are memory-specific.
Unlike SRAMs and Block Memories, DDR2 or DDR3 performance is not just the maximum data rate.
It depends on many timing factors, including:
- tRCD: Row Command Delay (or ras to cas delay).
- tCAS(CL): Column address strobe latency.
- tRP: Row precharge delay.
- tRAS: Row Active Time (activate to prechange).
- tRC: Row cycle time. tRC = tRAS + tRP
- tRAC: Random access delay. tRAC = tRCD + tCAS
- tCWLCASas write latency.
- tZQ: ZQ calibration time.
- tRFC: Row Refresh Cycle Time
- tWTR: Write to Read delay. Last write transaction to Read command time.
- tWR: Write Recovery time. Last write transaction to Precharge time
- The timing of all the listed parameters depends on the type of memory used and the memory part’s speed grade.
- More details on the definitions and timing specifications can be found in the DDR2 and DDR3 JEDEC standards or any memory device datasheet.
Efficiency mainly depends on how memory is accessed. Different address patterns give different efficiency results.
Memory timing overheads
- Activation time and Precharge time when changing to new banks/rows or changing rows within the same bank.- So, you reduce row change, which can remove tRCD and tRP.
- Send continuous write or read commands -Maintaining tCCD timing.
- Minimize write-to-read and read-to-write command changeover – Write recovery time to change to read accesses, and bus turnaround time to change from read to write.
- Set a proper refresh interval.
- a. DDR3 SDRAM requires Refresh cycles at an average periodic interval of tREFI.
- b. A maximum of 8 additional Refresh commands can be issued in advance (“pulled in”). This does not reduce the number of refreshes, but the maximum interval between two surrounding Refresh commands is limited to 9 × tREFI
- Utilize all of the banks – A suitable addressing mechanism is preferable.
- a. Row-Bank-Column: For a transaction occurring over a sequential address space, the core automatically opens up the same row in the next bank of the DRAM device to continue the transaction when the end of an existing row is reached. It is well suited to applications that require bursting of large data packets to sequential address locations.
- b. Bank-Row-Column: When crossing a row boundary, the current row will be closed, and another row will be opened within the same bank. MSB is a bank address that can be used to switch from different banks. It is suitable for shorter, more random transactions to one block of memory for some time, and then a jump to another block (bank)
- Burst Length
- a. BL 8 is supported for DDR3 onthe 7 series. BC4 has a very low efficiency, which is less than 50%. This is because the execution time of BC4 is the same as BL8. The data is just masked inside the component.
- b. In cases where you do not wish to write a full burst, either data mask or write-after-read can be considered.
- Set a proper ZQ interval (DDR3 Only)
The controller sends both ZQ Short (ZQCS) and ZQ Long (ZQCL) Calibration commands.- a. Adhere to the DDR3 JEDEC Standard
- b. ZQ Calibration is discussed in section 5.5 of the JEDEC Spec JESD79-3 DDR3 SDRAM Standard
- c. ZQ Calibration calibrates On-Die Termination (ODT) at regular intervals to account for variations across VT
- d. Logic is contained in bank_common.v/vhd
- e. Parameter Tzqcs determines the rate at which a ZQ Calibration command is sent to the memory
- f. It is possible to disable the counter and manually send using app_zq_req, it is Similar to manually sending a Refresh. Refer to (Xilinx Answer 47924) for details.
Controller Overheads
- Periodic Reads – Refer to (Xilinx Answer 43344) for details.
- a. Do not change the period of the read.
- b. Skip periodic reads during writes and issue the number of missed reads before a true read
- Reordering – Refer to (Xilinx Answer 34392) for details. For User and AXI Interface designs, it is preferable to have this enabled.
- a. Reorder is the logic that looks ahead several commands and changes the user command order to make nonmemory commands not occupy valid bandwidth. The performance also related to actual traffic pattern.
- b. Based on the address pattern, reorder helps to skip precharge and activate commands and makes tRCD and tRP do ot occupy the data bandwidth.
- Try to increase the number of Bank Machines.
- a. Most of the controller’s logic resides in the bank machines, a nd they correspond to DRAM banks
- b. A given bank machine manages a single DRAM bank at any given time.
- c. Bank machine assignment is dynamic, so it is not necessary to have a bank machine for each physical bank.
- d. Bank machines can be configured, but it is a tradeoff between area and performance.
- e. The allowable number of bank machines ranges from 2-8.
- f. By default, 4 Bank Machines are configured through RTL parameters.
- g. To change Bank Machines, consider the parameter nBANK_MACHS = 8 contained in memc_ui_top
Example for 8 Bank Machines – nBANK_MACHS = 8
You are now aware of the factors that influence performance. Consider an upstream application that gives you 512 data bytes per packet,t, and you need to save them to different memory locations. As 512 data bytes is equal to 64 DDR3 data bursts, re-run the example design with a stimulus file containing 512 writes, 512 reads, and row switching for every 64 writes or reads:
At the end of the simulation, you will see that bus utilization is at 77 percent.
Figure 11: Performance Statistics for 512 writes and 512 reads – Row switching for 64 writes or reads.
You can now apply the knowledge learned i he earlier section to improve the efficiency. To utilize all of the banks instead of changing the row, modify the address pattern to change the bank as shown below. This is equivalent to setting ROW_BANK_Column in the memory address mapping setting in the MIG GUI.
At the end of the simulation, you will see that the earlier 77 Percent Bus Utilization is now 87!
If you still require higher efficiency, you can go for large packet sizes of 1024 or 2048 bytes, or consider a manual refresh.
Note: Xilinx does not encourage bypassing controller refresh, as we are unsure if you will be able to meet the JEDEC auto Refresh timing, which affects data reliability. From the controller’s you can change NBANNBANk_MACH to see the performance improvement. However, this may affect your design timing, Please refer to (Xilinx Answer 36505) for details on nBANk_MACH.
Open core_name_mig_sim.v file and change the parameters nBANK_MACHS from 4 to 8 and re-run the simulation.
To have the parameter value take effect in hardware, you need to update the core_name_mig.v file. I used the same pattern where we got 87% bus utilization (Figure 2). With nBANK_MACHS set to 8, the efficiency is now 90%.
Also, make a note that ½ and ¼ controllers negatively affect efficiency due to their latencies. For example, since we can only send commands every 4 CK cycles, there is sometimes extra padding when adhering to minimum DRAM timing specs, which can decrease efficiency from the theoretical. Try out different controllers to find the one that suits your efficiency requirement. References
- Zynq-7000 AP SoC and 7 Series FPGAs MIS v2.3 [UG586]
- Xilinx MIG Solution Centre http://www.xilinx.com/support/answers/34243.html
Revision History
13/03/2015 – Initial release..
Documents / Resources
![]() |
XILINX 63234 END FPGA Distributor [pdf] User Guide 63234 END FPGA Distributor, 63234, END FPGA Distributor, FPGA Distributor |