Intel Unveils Lunar Lake Architecture: New P and E cores, Xe2-LPG Graphics, New NPU 4 Brings More AI Performance

Name: Intel Unveils Lunar Lake Architecture: New P and E cores, Xe2-LPG Graphics, New NPU 4 Brings More AI Performance
Item: Intel Unveils Lunar Lake Architecture: New P and E cores, Xe2-LPG Graphics, New NPU 4 Brings More AI Performance
Author: Gavin Bonshor

by Gavin Bonshor on June 3, 2024 11:00 PM EST

91 Comments | Add A Comment

91 Comments

Intel Lunar Lake: New P-Core, Enter Lion Cove

Diving straight into the Performance, or P-Core commonly referred to, has had major architectural updates to increase power efficiency and performance. Bigger of these updates, Intel needed to comprehensively update its classic P-core cache hierarchy.

Key among these improvements is a significant overhaul of Intel's traditional P-core cache hierarchy. The fresh design for Lion Cove uses a multi-tier data cache containing a 48KB L0D cache with 4-cycle load-to-use latency, a 192KB L1D cache with 9-cycle latency, and an extended L2 cache that gets up to 3MB with 17-cycle latency. In total, this puts 240KB of cache within 9 cycles' latency of the CPU cores, whereas Redwood Cove before it could only reach 48KB of cache in the same period of time.

The data translation lookaside buffer (DTLB) has also been revised, increasing its depth from 96 to 128 pages to improve its hit rate.

Intel has also added a third Address Generation Unit (AGU)/Store Unit pair to further boost the performance of data write operations. Intel has also thrown more cache at the problem, and as CPU complexity grows, so does the reliance on the cache subsystems to keep them fed. Intel has also reworked the core-level cache subsystem by adding an intermediate data cache (IDC) between the 48 KB L1 and the L2 level. The original L1D cache is now called the L0 D-cache internally and retires to a 192 KB L1 D-cache.

The latest Lion Cove P-core design also includes a new front-end for handling instructions. The prediction block is 8x larger, fetch is wider, decode bandwidth is higher than on Raptor Cove, and there has been an enormous increase in Uops cache capacity and read bandwidth. The change in Uop queue capacity is designed to enhance the overall performance throughput.

The out-of-order engine in Lion Cove is partitioned in the footprint for Integer (INT) and Vector (VEC) domains Execution Domain with Independent renaming and scheduling. This type of partitioning allows for expandability in the future, independent growth of each domain, and benefits toward reduced power consumption for a domain-specific workload. The out-of-order engine is also improved, going from 6 to 8-wide allocation/rename and 8 to 12-wide retirement, with the deep instruction window increased from 512 to 576 entries and from 12 to 18 execution ports.

Lion Cove's integer execution units have also been improved over Raptor Cove, with execution resources grown from 5 to 6 integer ALUs, 2 to 3 jump units, and 2 to 3 shift units. Scaling from 1 to 3 units, these multiply 64x64 units to 64, which takes 3 units and gives even more compute power for the harder part of computation. Another significant development is transforming the P-core database from a 'sea of fubs' to a 'sea of cells.' This process of migrating the sub-organization of the P-cores structure from fubs to more organized cells essentially increases the density.

Intel has removed Hyper-Threading (HT) from their Lunar Lake SoC, with one potential reason being to enhance power efficiency and single-thread performance. By eliminating HT, Intel reduces power consumption and simplifies thermal management, which should extend battery life in ultra-thin notebooks. Intel does make a couple of claims regarding the Lion Cove P-cores, which are set to offer approximately 15% better performance-to-power and performance-to-area ratios than cores with HT. Intel's hybrid architecture, which effectively utilizes E-cores for multi-threaded tasks, reduces the need for HT, allowing workloads to be distributed more efficiently by the Intel Thread Director.

Power management has also been refined by including AI self-tuning controllers to replace the static thermal guard bands. This lets the system respond dynamically to real-time operating conditions in an adaptive way to achieve higher sustained performance. Intel also implements Lion Cove P-Core clock speeds at tighter 16.67MHz intervals rather than the traditional 100MHz. This means more accurate power management and finer tuning to squeeze as much from the power budget as possible.

Intel's Lion Cove P-Core microarchitecture looks like a nice upgrade over Golden Cove. Lion Cove incorporates improved memory and cache subsystems and better power management while not relying solely on opting for faster P-core frequencies to boost the IPC performance.

Intel Unveils Lunar Lake Architecture: Overview Intel Lunar Lake: New E-Core, Skymont Takes Flight For Peak Efficiency

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

91 Comments

View All Comments

thestryker - Monday, June 3, 2024 - link
I'm curious what the overall E-core performance is going to look like since the cluster won't have L3 cache access. Chips and Cheese did some analysis of the LP E-cores on MTL and found this specifically to be a big negative. I'm guessing this design is going to be limited to just LNL and is predominantly for the power savings.
ET - Tuesday, June 4, 2024 - link
Interestingly, Intel is comparing Skymont to Raptor Cover. I agree that we have to wonder how the L3 (or lack thereof) affect this, but from the Chips and Cheese figures alongside Intel's performance improvement figures, it looks like Skymont without L3 cache will be faster than Crestmont with L3 cache.
kwohlt - Tuesday, June 4, 2024 - link
There's 8MB of "SOC cache", separate from both the P and E cores, that should in practice function as the E cores' L3
thestryker - Tuesday, June 4, 2024 - link
That's my assumption as well as I think the GPU would be the other part predominantly using it and they shouldn't really both be hitting it at the same time.
sharath.naik - Monday, June 10, 2024 - link
Side cache is not the same as L3, or I think they would have called it that. shared L3 is where the memory sync can happen across cores. if not, it needs to go all the way back to ram. So, side caches really cannot be considered as L3, more like expanded L2 for E-core and expanded l3 for P-Core? is my guess. Yes, it means things that run on both E-Core and P-Core, at the same time, will take a hit on performance. I think they were targeting the majority use case. where most won't need more than 4 threads or threads won't be working on the same data.
powerarmour - Thursday, June 6, 2024 - link
I can see this being an embarrassing launch if it gets slapped around by Qualcomm's SDx Elite
mode_13h - Friday, June 7, 2024 - link
Well, they're on a better node that Qualcomm, so there's that.
sharath.naik - Monday, June 10, 2024 - link
It absolutely will. Because this is going to be slower than meteor lake in CPU. Elite is supposed to be 30% faster. Intel should have released 8 P-core version to compete in performance. But I think they wanted to reserve that to be produced on their own fabs.
lmcd - Monday, June 17, 2024 - link
Snap Elite is supposed to be 30% faster at essentially-undisclosed power. Lunar Lake will ironically undercut the Snapdragon Elite on power and cost while delivering good performance.
Drumsticks - Tuesday, June 4, 2024 - link
I hate to ask this, but was this article fully written by Gavin and proof'ed by another editor? Was there a deadline push to get it out as soon as Intel released the information on Lunar Lake? It just reads so, so disjointed. It feels like there are so many issues in this paragraph alone on the P-core overview; it feels jarring to read.

"This Lion Cove architecture **also aligns with performance increases**, boasting a predicted double-digit bump in IPC over the older Redwood Cove generation. This uplift is noticed, especially **in the betterment of its hyper-threading, whereby improved IPC** by 30%, dynamic power efficiency improved by 20%, **and previous technologies, in balancing**, without increasing the core area, **in a commitment of Intel to better performance**, within existing physical constraints."

I've seen so much better work from Gavin, and Anandtech in general, that I almost hope that this page was heavily written by software. I know it's a press release, and there's not a whole lot of information, but the level of first party detail here feels similar to the Architecture Day 2021 presentations Intel did on Alder Lake, which got fantastic coverage from Andrei and Dr. Cuttress, and here it feels like we are getting a poorly worded restating of the slides with hardly any analysis or greater than surface level understanding.

I've been reading Anandtech since I was 15, and the level of detail in the Sandy Bridge era articles honestly had a huge influence on my choice to pursue a career in CPU Design. I've mountains of respect for what Anandtech has published in the past, but this article feels rushed.

Intel Unveils Lunar Lake Architecture: New P and E cores, Xe2-LPG Graphics, New NPU 4 Brings More AI Performance

Intel Lunar Lake: New P-Core, Enter Lion Cove

Post Your Comment

91 Comments

View All Comments

thestryker - Monday, June 3, 2024 - link

ET - Tuesday, June 4, 2024 - link

kwohlt - Tuesday, June 4, 2024 - link

thestryker - Tuesday, June 4, 2024 - link

sharath.naik - Monday, June 10, 2024 - link

powerarmour - Thursday, June 6, 2024 - link

mode_13h - Friday, June 7, 2024 - link

sharath.naik - Monday, June 10, 2024 - link

lmcd - Monday, June 17, 2024 - link

Drumsticks - Tuesday, June 4, 2024 - link

Log in

Don't have an account? Sign up now