NVIDIA GeForce GTX 750 Ti Whitepaper

Featuring First-Generation Maxwell GPU Technology, Designed for Extreme Performance per Watt

Version 1.1

Introduction

Consumer demand for more stunning graphics, from blockbuster movie special effects to near-photorealistic 3D game environments and high-resolution media, continues to grow. To meet this demand, NVIDIA's graphics processors have evolved, incorporating new features and increasing power with each generation. The Kepler GPU architecture, introduced in early 2012, delivered groundbreaking performance and power efficiency, powering gaming PCs, workstations, supercomputers, and cloud gaming servers. It was also implemented in the Tegra K1 system-on-a-chip for mobile devices and automotive infotainment systems.

To achieve the next level of visual realism, NVIDIA engineers focused on making the subsequent architecture even more efficient than Kepler. NVIDIA's first-generation "Maxwell" architecture introduces enhancements designed to extract more performance per watt. The first Maxwell-based GPU, codenamed "GM107," is designed for power-limited environments like notebooks and small form factor (SFF) PCs, often used for gaming and home entertainment, including Valve Software's Steam Machines initiative. The GeForce GTX 750 Ti is based on the GM107 GPU. Due to GM107's architectural efficiency, at 1080p resolution, a GeForce GTX 750 Ti can frequently match the performance of the GeForce GTX 480 (a flagship GPU from four years prior) while consuming only a 60W TDP, a quarter of the power.

Figure 1: GeForce GTX 750 Ti performs evenly with GTX 480 in many of today's top titles. This bar chart compares the performance of the NVIDIA GeForce GTX 750 Ti and the older GeForce GTX 480 across various games and benchmarks. The chart shows that the GTX 750 Ti performs comparably to, and in some cases exceeds, the GTX 480, while consuming significantly less power. The Y-axis represents relative performance, ranging from 0.6 to 1.2.

The Soul of Maxwell: Improving Performance per Watt

GM107 is the first GPU built using the first-generation Maxwell architecture, focusing on low power operation. NVIDIA plans to introduce higher-performing second-generation Maxwell GPUs for performance and enthusiast segments later.

During the transition from Kepler GPUs (used in PCs, workstations, supercomputers) to mobile chips, NVIDIA learned how to reduce GPU power consumption and extract more performance at the same power level. These learnings were applied to Maxwell.

Maxwell introduces an all-new Streaming Multiprocessor (SM) design, dramatically improving performance per watt and performance per area. While the Kepler SMX design was efficient, Maxwell SM represents a significant leap in architectural efficiency. Enhancements in control logic partitioning, workload balancing, clock-gating granularity, scheduling, instructions per clock cycle, and other areas allow the Maxwell SM (also called "SMM") to far exceed Kepler SMX efficiency. The new Maxwell SM architecture enabled NVIDIA to implement five SMs in GM107, compared to two in GK107, with only a 25% increase in die area. Further details on the Maxwell SM changes are provided in the "Next-Generation Maxwell SM" section.

Maxwell also features a significantly larger L2 cache design: 2048KB in GM107 versus 256KB in GK107. This larger on-chip cache reduces the need for requests to the graphics card DRAM, lowering overall board power and improving performance.

Additionally, NVIDIA engineers meticulously tuned each unit in the Maxwell GPU down to the transistor level to maximize energy efficiency. The result is that Maxwell delivers 2 times the performance per watt of Kepler, using the same 28nm manufacturing process.

GM107 Maxwell Architecture In-Depth

From a graphics features perspective, first-generation Maxwell GPUs offer the same API functionality as Kepler GPUs. At a high level, Maxwell implements multiple SM units within a GPC (Graphics Processing Cluster). Each SM includes a Polymorph Engine and Texture Units, while each GPC includes a Raster Engine. ROPs remain aligned with L2 cache slices and Memory Controllers. Internally, all units and crossbar structures have been redesigned, data flows optimized, and power management significantly improved.

The GM107 GPU features one GPC, five Maxwell Streaming Multiprocessors (SMM), and two 64-bit memory controllers (128-bit total). This represents the full implementation of the chip and is the configuration used in the GeForce GTX 750 Ti.

Figure 2: GM107 Full-Chip Block Diagram. This block diagram illustrates the full-chip architecture of the GM107 GPU. It shows the main components including the PCI Express 3.0 Host Interface, GigaThread Engine, a Graphics Processing Cluster (GPC), a Raster Engine, L2 Cache, and Memory Controllers. The GPC contains five Maxwell Streaming Multiprocessors (SMMs). Each SMM is detailed with components like the Polymorph Engine 2.0, Instruction Buffer, Warp Scheduler, Dispatch Units, Register File, Core units, Load/Store (LD/ST) units, Special Function Units (SFU), and Texture/L1 Cache. Below the GPC is the L2 Cache, connected to two Memory Controllers.

Maxwell vs. Kepler GPU Comparison

GPU	GK107 (Kepler)	GM107 (Maxwell)
CUDA Cores	384	640
Base Clock	1058 MHz	1020 MHz
GPU Boost Clock	N/A	1085 MHz
GFLOPS	812.5	1305.6
Texture Units	32	40
Texel fill-rate	33.9 Gigatexels/sec	40.8 Gigatexels/sec
Memory Clock	5000 MHz	5400 MHz
Memory Bandwidth	80 GB/sec	86.4 GB/sec
ROPs	16	16
L2 Cache Size	256KB	2048KB
TDP	64W	60W
Transistors	1.3 Billion	1.87 Billion
Die Size	118 mm²	148 mm²
Manufacturing Process	28-nm	28-nm

The basic aspects of the architecture, such as dataflow from the host PCI Express interface through the GigaThread engine and the operation of Polymorph and Raster units, are discussed in greater detail in NVIDIA's Kepler and Fermi whitepapers. For additional background information, reading those documents is recommended. More detail on the changes introduced in SMM follows.

Next-Generation Maxwell SM

The primary contributor to Maxwell's improved efficiency is the new Maxwell SM architecture, SMM. This architecture achieves higher power efficiency and delivers 35% more performance per CUDA Core on shader-limited workloads. These results were achieved through significant architectural changes, including a rewritten SM scheduler architecture and algorithms designed to be more intelligent, avoid unnecessary stalls, and reduce energy per instruction for scheduling.

The SM organization has also been updated. Each SM is now partitioned into four separate processing blocks, each with its own instruction buffer, scheduler, and 32 CUDA cores. This eliminates Kepler's approach of using a non-power-of-two number of CUDA cores with some shared. This partitioning simplifies design and scheduling logic, saving area and power, and reducing computation latency.

Pairs of processing blocks share four texture filtering units and a texture cache. The compute L1 cache function is now combined with the texture cache, and shared memory is a separate unit shared across all four blocks, similar to the approach used on the G80, the first CUDA-capable GPU.

Overall, this new design makes each "SM" significantly smaller while delivering about 90% of a Kepler SM's performance. The smaller area allows for more SMs per GPU. Comparing GK107 and GM107 total SM-related metrics, GM107 has five SMs versus GK107's two, offers 25% more peak texture performance, 1.7 times more CUDA cores, and about 2.3 times more delivered shader performance.

Figure 3: Maxwell SM Block Diagram. This block diagram details the internal structure of a single Maxwell Streaming Multiprocessor (SMM). It depicts the PolyMorph Engine 2.0 and its associated units (Tessellator, Viewport Transform, Attribute Setup, Stream Output). The SMM is divided into two main processing sections, each containing an Instruction Buffer, Warp Scheduler, Dispatch Unit, and a Register File. Both sections include multiple Core units, LD/ST units, and SFU units, and share Texture/L1 Cache blocks. Additionally, there is a 64 KB Shared Memory block accessible by the SMM.

Memory System

For GM107, enhancing the memory system was crucial to achieving significantly higher performance with the same memory width as GK107. On-chip memory system bandwidth was increased, and the design's efficiency was improved. The large 2MB L2 cache configuration, larger than any previous GPU design, is highly effective at reducing memory bandwidth demand and ensuring that DRAM bandwidth is not a bottleneck.

New Video Capabilities

One of Kepler's key innovations over prior GeForce GPUs was its hardware-based H.264 video encoder, NVENC. By integrating dedicated hardware circuitry for video encoding/decoding (rather than using the GeForce GPU's CUDA Cores), NVENC provided a dramatic performance speedup for H.264 encoding while consuming less power. NVIDIA leveraged Kepler's NVENC encoder to introduce ShadowPlay to GeForce GTX 600 and 700 series gamers, allowing them to record and share gaming moments. Since its launch, over 3 million videos have been captured, with gamers posting them to YouTube or streaming live gameplay footage on Twitch.

Maxwell features an improved NVENC block for better video performance, offering faster encode (6-8X real-time for H.264 compared to 4x for Kepler) and 8-10X faster decode. A new local decoder cache also provides higher memory efficiency per stream for video decoding, resulting in lower power consumption during video decode.

Maxwell also includes a new GC5 power state, tailored to reduce GPU power consumption for light workload cases like video playback. GC5 is a low power sleep state that offers considerable power savings over prior GPUs in these scenarios.

Conclusion

Given the challenges in developing smaller semiconductor manufacturing process nodes, NVIDIA engineers recognized that Maxwell architecture needed to be more efficient for PC graphics to continue evolving. Simply building a bigger Kepler was insufficient; the focus was on delivering groundbreaking performance per watt with Maxwell.

To improve performance while minimizing wasted power, the SMs were grouped into quads, each with dedicated resources for scheduling and instruction dispatch. The L2 cache size was dramatically increased, providing a shared storage buffer for texture requests, atomic operations, and other tasks, reducing trips to memory.

With the changes in Maxwell's new SMM, the GPU's hardware units are utilized more effectively, resulting in greater performance and power efficiency. The GeForce GTX 750 Ti delivers over 1.7X more performance than GK107, with a TDP of just 60W.

Twenty years ago, PCs were large towers. Today, PCs are smaller and found everywhere. Gaming on these tiny PCs was previously a mundane experience, often relying on integrated graphics with low frame rates and settings. However, thanks to Maxwell's power efficiency, gamers with home theater and small form factor PCs can now enjoy a good gaming experience at 1080p without compromise.

The GeForce GTX 750 Ti can be easily plugged into a wide range of PCs without needing power supply upgrades in most cases, transforming a basic PC into a competent gaming machine. Its low power consumption means it runs quietly and generates little heat, making it ideal for home theater PCs. It is noted as the world's fastest graphics card that does not require a power connector.

NVIDIA's focus on performance per watt makes Maxwell the world's most efficient GPU, allowing gamers to enjoy their favorite titles in virtually any form factor.

Notice

All information provided in this technology brief, including commentary, opinion, NVIDIA design specifications, reference boards, files, drawings, diagnostics, lists, and other documents (collectively, "Materials"), are provided "AS IS." NVIDIA makes no warranties, expressed, implied, statutory, or otherwise, with respect to Materials, and expressly disclaims all implied warranties of NONINFRINGEMENT, MERCHANTABILITY, and FITNESS FOR A PARTICULAR PURPOSE.

Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation.

Trademarks

NVIDIA, the NVIDIA logo, CUDA, FERMI, KEPLER, MAXWELL, and GeForce are trademarks or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

	NVIDIA CUDA Fermi Compatibility Guide A guide for developers to ensure CUDA applications are compatible with NVIDIA's Fermi architecture, covering compatibility, requirements, and building applications using CUDA Toolkit.
	NVIDIA TITAN X User Guide: Installation, Features, and Compliance Comprehensive user guide for the NVIDIA TITAN X graphics card, covering hardware installation, software setup with GeForce Experience, SLI configuration, HDMI audio, key features, and regulatory compliances.
	NVIDIA GeForce RTX 4060 Ti Quick Start Guide: Installation and Setup Comprehensive guide to installing and setting up your NVIDIA GeForce RTX 4060 Ti graphics card, including system requirements, hardware installation, and software configuration.
	NVIDIA GeForce RTX 3080 Technical Specifications Comprehensive technical specifications for the NVIDIA GeForce RTX 3080 graphics card, detailing GPU architecture, memory, performance metrics, connectivity options, power requirements, and physical dimensions.
	NVIDIA Ampere GPU Architecture Compatibility Guide for CUDA Applications NVIDIA's guide for developers to ensure CUDA applications are compatible with NVIDIA Ampere GPU architecture, covering verification and building strategies for CUDA Toolkit versions.
	NVIDIA Tesla C2050/C2070 GPU Computing Processor Datasheet Datasheet for NVIDIA Tesla C2050 and C2070 GPU computing processors, detailing supercomputing performance, technical specifications, features, and benefits for high-performance computing. Highlights Fermi architecture, CUDA cores, ECC memory, and PCIe Gen 2.0 data transfer.
	NVIDIA GeForce RTX 2080 User Guide: Installation, Features, and Setup A comprehensive user guide for the NVIDIA GeForce RTX 2080 graphics card, detailing hardware installation, software setup with GeForce Experience, NVLink configuration, HDMI audio setup, key features, and regulatory compliances.
	NVIDIA TensorRT Support Matrix v4.0.1 - Platform and Layer Compatibility Comprehensive support matrix for NVIDIA TensorRT version 4.0.1, detailing compatibility across platforms (Linux, Android, QNX) and software versions (CUDA, cuDNN), along with a detailed breakdown of supported features for each TensorRT layer.