Intel responds to AMD's POV display.
Forum CPU & Components : CPUs - Intel responds to AMD's POV display.
From our favorite rag..
http://theinquirer.net/default.aspx?article=39896
..."That said, the numbers speak for themselves, 4933, almost 5000, with only 8 cores."
I wonder how AMD is going to react to this. Maybe they will release more info about their test system to show if there is going to be enough headroom to beat Intel or not.
I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.
http://www.uberpulse.com/us/2007/0 [...] s_fast.php
During what time interval does he say it only recognizes only 2 sockets? He says that Intel's Core is only capable of 2 socket, then shows the 4 socket machines, then task manager clearly shows 8 and 16 cores.
| Quote : I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel. |
http://www.povray.org/download/
The most significant change from the end-user point of view between versions 3.6 and 3.7 is the addition of SMP (symmetric multiprocessing) support, which in a nutshell allows the renderer to run on as many CPU's (or cores) as you have installed on your computer. This will be particularly useful for those users who intend purchasing a dual-core CPU or who already have a two (or more) processor machine.
The change from 3.6 to 3.7 refers to memory ueage; 3.6 sucked so much that each thread allocated it's own copy of the object in the RAM, so, to use 8 cores you used 8 times more ram than you would normally use, however, I haven't seen any documentation of POV-Ray referring to the maximum number of cores and sockets, because even some of the most advanced renderers, have the core number limited to 8, and a normal copy of Windows won't let them run more than 2 sockets.
| Quote : I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel. |
Wonder where you got that info from? On the other hand I opted to email POV-Ray at this address: warp@iki.fi (on their site)... here's the exchange.
My Email:
| Quote : Hi there,
|
Their Response:
| Quote : Hello,
|
So the whole AMD K10 Quad Socket setup would not have had any issues being supported by POV-Ray. Of course Windows support is another thing entirely and luckilly for us most of these demo's are run on Server versions of Windows.
OK, looks like they support every number of cores, however, it looks a bit of a miraculous scaling from choppy, not to say prohibitive multithreaded support in V3.6 , to an infinity of perfectly smooth, available threads in 3.7
I dont think that an advanced software is limited to 2 sockets (not taking into consideration that socket count would be a weird limitation).
Even Cinebench for example, which is useful for err... zero things does support as many CPUs as you have.
And don't forget that most probably they got the crappy V 3.6 (see the radicalchanges between that and 3.7) for their tests, because 3.7 is still in beta and I doubt they took beta software for that test:
http://www.povray.org/download/
| Quote : And don't forget that most probably they got the crappy V 3.6 (see the radicalchanges between that and 3.7) for their tests, because 3.7 is still in beta and I doubt they took beta software for that test:
|
AMD would have taken the beta. AMD and Intel both will use BETA software that suits their needs much like ATi or nVIDIA use BETA alpha veraions of games to try and sell their superiority.
AMD wanted to highlight the power of their 16 cores.. so it's only logical that they would have used version 3.7.
CLICK ME There's your info. AMD was using version 3.7 BETA.
| Quote : It turns out that none of these questions is appropriate. Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access; (2) PoV-Ray SSE seems to be optimized more specifically for Core 2 than anything else, where on K8 it is only about 5% faster than PoV-Ray x87. This is also not going to change with K10. |
| Quote : Second, comparing the K10 instruction latency with the K8 instruction latency, we find that K10 has little, if any, improvement on scalar SSE instructions; worse yet, some CVTxx2yy instructions are even downgraded and have longer decode and higher latency. What this shows is that PoV-Ray SSE remains unfriendly to both the K8 and K10 microarchitectures. Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there are some core improvements at work elsewhere inside the K10 design. |
So again it's SSE coming to bite AMD in the butt. And this will probably also be AMD's undoing in many professional apps. They really need to get their SSE performance up to par with Intel.
| Quote : I dont think that an advanced software is limited to 2 sockets (not taking into consideration that socket count would be a weird limitation).
|
If you read how the V3.6 worked in multithreading, you won't think any more it's that advanced; you have, say a 1G object to throw on the RAM and make the render calculations on it?!; with v3.6 you have to have one copy of the object FOR EACH THREAD 8O , so basically, a 16 core barcelona needs 16G of RAM instead of 1 needed by v3.7. So if those systems had 6G of RAM and that scene was near 500MB, the barcelona system should have swapped a lot on the HDD.
| Quote : I dont think that an advanced software is limited to 2 sockets (not taking into consideration that socket count would be a weird limitation).
|
If you read how the V3.6 worked in multithreading, you won't think any more it's that advanced; you have, say a 1G object to throw on the RAM and make the render calculations on it?!; with v3.6 you have to have one copy of the object FOR EACH THREAD 8O , so basically, a 16 core barcelona needs 16G of RAM instead of 1 needed by v3.7. So if those systems had 6G of RAM and that scene was near 500MB, the barcelona system should have swapped a lot on the HDD.
Good thing it used version 3.7
| Quote : AMD wanted to highlight the power of their 16 cores.. so it's only logical that they would have used version 3.7. |
AMD only wanted to show a 2X scaling within the same TDP and however, seeing the block diagrams of both the K8 and K10, I just can't believe their SSE performance is identical, like one of the interpretations of that demo can make us think. At the end, it's all assumptions and really stupid to talk about unspecified CPUs, at unspecified clock rates running unspecified software.
Well you should at least read that post. Because it does explain why Intel's Core 2 does so well under POV-Ray. Basically POV-Ray is very optimized for SSE (SSE2 etc). As such any processor able to execute SSE code the fastest will have an advantage. Especially a processor such as the Core2 who's entire design is about efficiency.
Most professional applications are optimizes heavilly for SSE (like 99.9% of them).
So AMD really needs to do something about their SSE perfomance... if not then Penryn will eat K10 alive (full SSE4 on Penryn) under the majority of applications.
What remains to be seen is the per clock performance of K10 which I do believe to be superior without a doubt over Core 2.
| Quote : Good thing it used version 3.7 |
How you please, but till we get our feet on the ground, we're going more or less like this:
| Quote : Well you should at least read that post. Because it does explain why Intel's Core 2 does so well under POV-Ray. Basically POV-Ray is very optimized for SSE (SSE2 etc). As such any processor able to execute SSE code the fastest will have an advantage. Especially a processor such as the Core2 who's entire design is about efficiency. |
And isn't barcelona built for SSE efficiency too, having double the number of SSE engines, better prefetches and things like this; maybe not better than Core2, but how can that thing perform THE SEAME AS A K8 :?: :!:
| Quote :
|
POV-Ray is only optimized up to SSE2, which AMD supports since the first K8s and it scales almost linearly in performance with Intel CPUs; so no need to mention SSE4 here because it really does not mater. Current professional SW today are mostly optimized for SSE2, because this ensures a wider compatibility starting from the first P4s and K8s. SSE3 is just starting to be used in such software and SSE4 will take it's time.
So, you still want us to go like this
there is nothing solid to base this discussion till we have real numbers
[/quote]
Gahhh you didn't read the article now did you...
I'll post again...
| Quote : Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access; |
Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".
Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.
Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.
Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).
This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.
Wait; who said Core2 performs 5 operations and K10 only 3; from all the articles I have seen, they process theoretically the same number of SSE instructions per cycle
You're thinking IPC wise (Integer's per Cycle)... not Double Precision FP wise.
| Quote : OK, looks like they support every number of cores, however, it looks a bit of a miraculous scaling from choppy, not to say prohibitive multithreaded support in V3.6 , to an infinity of perfectly smooth, available threads in 3.7 |
That's simply what you get out of correctly supporting SMP in a multithreaded application. Sometimes the tiniest code changes can cause a huge performance and resource-usage impact. POVRay 3.7 exhibits 2x scaling while v. 3.6 as well as Cinebench exhibit ~1.8x scaling at best.
| Quote : AMD wanted to highlight the power of their 16 cores.. so it's only logical that they would have used version 3.7.
|
That's the best analysis I've seen so far and basically says all the marketing speak about 2x bandwidth, double-width FP units, and single-cycle SSE doesn't tell the full story in an "SSE2-supporting" application.
But another possibility to consider besides writing off K10 SSE units as broken (perhaps poorly reverse engineered) is that AMD hasn't had the resources to develop compilers like Intel does. There may be an implementation that lets K10 shine like C2D, but without a compiler to make it happen, K10 is stuck running code suboptimally.
| Quote : I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel. |
Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.
| Quote : I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel. |
Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.I'm rather surprised that AMD chose POVray over Sciencemark to highlight their architectural advantages. I guess scaling was their main criteria, for the demo, and maybe Sciencemark isn't showing the same degree of scaling. :? It's inarguably known that the K8(and i would assume K10) arch. owns the Sciencemark suite.
| Quote : I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel. |
Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.
You should have read the whole thread before responding.
| Quote : I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel. |
Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.
Dum, read the whole thread why don't ya. Point debunked.
| Quote : That's simply what you get out of correctly supporting SMP in a multithreaded application. Sometimes the tiniest code changes can cause a huge performance and resource-usage impact. POVRay 3.7 exhibits 2x scaling while v. 3.6 as well as Cinebench exhibit ~1.8x scaling at best. |
It's not simple speed scaling I am talking about (I am also overlooking their stupid comments after which POV-Ray usually gains 100% 8O with doubling cores and those 'inexperienced' sites have tested on slower CPUs-don't know why the heck a slower CPU is 20% less efficient than a faster one- and gotten ~85% of gain)
; v3.6 uses a hell lot more memory than v3.7 ; it uses the scene's memory multiplied by the cores that render it (each core needs it's own copy), while v3.7 uses only one copy for all the threads.
In simple words; If THAT quad socket K10 system (with 6G of RAM) was using v3.6, the amount of RAM has been multiplied by 16, more or less placing a warranty on heavy HDD swapping.
As mentioned before, Povray 3.6 doesn't support SMP.
Any good version that I can my desktop on? You know, just to test my old beater.
| Quote : In simple words; If THAT quad socket K10 system (with 6G of RAM) was using v3.6, the amount of RAM has been multiplied by 16, more or less placing a warranty on heavy HDD swapping. |
In even simpler words, if AMD was dumb enough to do that then they deserve all this bad press about 10 times over
| Quote : Gahhh you didn't read the article now did you...
|
Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".
Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.
Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.
Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).
This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.
I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.
Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info
Ryan
| Quote : That's simply what you get out of correctly supporting SMP in a multithreaded application. Sometimes the tiniest code changes can cause a huge performance and resource-usage impact. POVRay 3.7 exhibits 2x scaling while v. 3.6 as well as Cinebench exhibit ~1.8x scaling at best. |
It's not simple speed scaling I am talking about (I am also overlooking their stupid comments after which POV-Ray usually gains 100% 8O with doubling cores and those 'inexperienced' sites have tested on slower CPUs-don't know why the heck a slower CPU is 20% less efficient than a faster one- and gotten ~85% of gain)
; v3.6 uses a hell lot more memory than v3.7 ; it uses the scene's memory multiplied by the cores that render it (each core needs it's own copy), while v3.7 uses only one copy for all the threads.
In simple words; If THAT quad socket K10 system (with 6G of RAM) was using v3.6, the amount of RAM has been multiplied by 16, more or less placing a warranty on heavy HDD swapping.
With all due respect - I believe that most of the people got the point about 3.6 using A LOT of RAM the first time You mentioned it and the rest got it with the second post on the subject. It is a fact that the 3.7 beta was used for the test and it made use of all available cores (i.e. 16). 4933>4000.
I think AMD is sane and wanted to show a scaling improvement. Since we dont know the freq of any of these cpu's then it really isnt showing anything more than the scaling improvements. As has been said AMD woyuld truly deserve to be kicked if they tried to showcase this as an improvement in performance other than scaling
The only one who didn't get it is you. Did you miss the:
| Quote : ...more or less placing a warranty on heavy HDD swapping. |
If that system used HDD swapped memory after it finished ALL the RAM, the render time can slow down VEEEEERY much, because the RAM bottlenecks the CPU and does not really matter if the CPU is worse or better than the other; if it starts reading from the HD
Short story: Last week I had to render a file which, in pre-render phase was 1.2GB. My 3.0GHz P4 at work, with 2G RAM crunched one shot in ~20 minutes while my home X2 4200+ (much faster but with only 1G of RAM) only managed to do it in 28 minutes, and if you opened the task managet, you could easily understand why
| Quote : Any good version that I can my desktop on? You know, just to test my old beater. |
To get Povray running, you need to get the 3.6 executable package from:
http://www.povray.org/download/
then get the 3.7 zip package:
http://www.povray.org/beta/
Install 3.6 then unzip 3.7.
At least they finally updated the beta so that it wasn't already expired and forced one to change the computer clock just to get it to run.
| Quote :
|
If that system used HDD swapped memory after it finished ALL the RAM, the render time can slow down VEEEEERY much, because the RAM bottlenecks the CPU and does not really matter if the CPU is worse or better than the other; if it starts reading from the HD
The Povray benchmark uses about 30MB of memory. Here Xbitlabs' V8 system scored 4677 with 4GB of memory:
http://www.xbitlabs.com/articles/c [...] v8_11.html
| Quote : Gahhh you didn't read the article now did you...
|
Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".
Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.
Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.
Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).
This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.
I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.
Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info
Ryan
Wanna know what's funny.. looks like Arstechnica got it wrong. The P6 does in fact have 10 stages.
Virginia.edu
And also here.. Wikipedia P6
| Quote : Superpipelining, which increased from Pentium's 5-stage pipeline to 14 of the Pentium Pro, and eventually morphed into the 10-stage pipeline of the Pentium III, and the 12- to 14-stage pipeline of the Pentium M. |
| Quote : I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel. |
Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.
Is this yet another case of seeing only what you want to see? POVRay has no documented issue scaling past 2 sockets, yet you believe m25 right away.
I don't think there can be a dispute that the machine on the left loaded 8 cores and the machine on the right, 16. Watch the original video again - http://www.youtube.com/watch?v=VGiv9Dtrc5Q. From 0:45 to 1:10, pause or take a snapshot and count up the columns representing cores in task manager - on the left, clearly 8, and on the right, I know the video is blurry, but it's at least pretty close to 16.
Observe that the overall load represented by the separate green bars at the left of task manager remained at around 100%. That proves there was no substantial penalty from disk swapping during the videotaped portion. The benchmark only takes <30MB here on 32-bit Windows 2003 Server (probably the OS they were using); multiplying that by 16 cores and 2x for worst-case 64-bit memory scaling, the total comes out to 960MB RAM, so total memory is a non-issue.
| Quote : Gahhh you didn't read the article now did you...
|
Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".
Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.
Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.
Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).
This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.
I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.
Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info
Ryan
Wanna know what's funny.. looks like Arstechnica got it wrong. The P6 does in fact have 10 stages.
Virginia.edu
And also here.. Wikipedia P6
| Quote : Superpipelining, which increased from Pentium's 5-stage pipeline to 14 of the Pentium Pro, and eventually morphed into the 10-stage pipeline of the Pentium III, and the 12- to 14-stage pipeline of the Pentium M. |
I wouldn't trust the PowerPoint from UVa either. It doesn't cite any sources or the credentials of the author (or even who the author is for that matter.) Plus it contradicts your table. The PPT says that Netburst had a pipeline depth of 20, yet the table says 21+8.
I don't think Wiki can be trusted here either. (We all know all too well that Wiki is to be taken with a grain of salt.) It states that the Pentium Pro was 14 stages and P3 was 10. Heck, the Wiki article even contradicts itself regarding the length of the PM pipeline. I cannot find any other source that states the pipeline was changed from PP to P3.
I only brought it up because I don't know how relevant the table is to the thread if it's not even accurate.
Ryan
What I meant was, K8 didnt perform with sse well and apparently neither does K10.
| Quote : Second, comparing the K10 instruction latency with the K8 instruction latency, we find that K10 has little, if any, improvement on scalar SSE instructions; worse yet, some CVTxx2yy instructions are even downgraded and have longer decode and higher latency. What this shows is that PoV-Ray SSE remains unfriendly to both the K8 and K10 microarchitectures. Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there are some core improvements at work elsewhere inside the K10 design. |
They are scaling somehow tho, using what improvements? Ideas? I know pov isnt a K8 friendly, so maybe without any changes (sse,smp ) it still scales well.
| Quote : Gahhh you didn't read the article now did you...
|
Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".
Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.
Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.
Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).
This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.
I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.
Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info
Ryan
Wanna know what's funny.. looks like Arstechnica got it wrong. The P6 does in fact have 10 stages.
Virginia.edu
And also here.. Wikipedia P6
| Quote : Superpipelining, which increased from Pentium's 5-stage pipeline to 14 of the Pentium Pro, and eventually morphed into the 10-stage pipeline of the Pentium III, and the 12- to 14-stage pipeline of the Pentium M. |
I wouldn't trust the PowerPoint from UVa either. It doesn't cite any sources or the credentials of the author (or even who the author is for that matter.) Plus it contradicts your table. The PPT says that Netburst had a pipeline depth of 20, yet the table says 21+8.
I don't think Wiki can be trusted here either. (We all know all too well that Wiki is to be taken with a grain of salt.) It states that the Pentium Pro was 14 stages and P3 was 10. Heck, the Wiki article even contradicts itself regarding the length of the PM pipeline. I cannot find any other source that states the pipeline was changed from PP to P3.
I only brought it up because I don't know how relevant the table is to the thread if it's not even accurate.
Ryan
The table is 100% accurate.
Manchester.edu
INTEL.com <------
| Quote : The idea of out-of-order execution, or executing independent program instructions out of program order to achieve a higher level of hardware utilization, was first implemented in the P6 microarchitecture. Instructions are executed through an out-of-order 10 stage pipeline in program order. |
Netburst Northwood has a 21 stage pipeline (varying on how you look at it)
Intel.com Willamette had a 20 stage pipeline.
The +8 represents the much despised and ridiculed mis-predictions. It's not an accurate number but rather an estimate see although the whole 20-21 stages are misprediction pipelines, the Netburst architecture actually mispredicted quite often, more so then any other architecture thus far. Although Intel's Rapid Fire Execution engine (ALU's running at twice the Clock speed) helps alleviate some of the cost in performance from a misprediction it's not enough. The architecture itself is VERY inneficient.
| Quote : Gahhh you didn't read the article now did you...
|
Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".
Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.
Then there's the second part of this... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.
Not all apps use these optimizations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimization as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).
This exlpains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.
If this is the case, then it might mean we have another r600 here, one that has extreme power in one area, but has a few huge bottlenecks that will completely hamper its performance
Makes me wonder a bit really if we have r600 delays because they can't beat core 2, or if they really do have legit reasons
btw...I fixed most of your spelling errors
| Quote : I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel. |
Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.
Is this yet another case of seeing only what you want to see? POVRay has no documented issue scaling past 2 sockets, yet you believe m25 right away.
I don't think there can be a dispute that the machine on the left loaded 8 cores and the machine on the right, 16. Watch the original video again - http://www.youtube.com/watch?v=VGiv9Dtrc5Q. From 0:45 to 1:10, pause or take a snapshot and count up the columns representing cores in task manager - on the left, clearly 8, and on the right, I know the video is blurry, but it's at least pretty close to 16.
Observe that the overall load represented by the separate green bars at the left of task manager remained at around 100%. That proves there was no substantial penalty from disk swapping during the videotaped portion. The benchmark only takes <30MB here on 32-bit Windows 2003 Server (probably the OS they were using); multiplying that by 16 cores and 2x for worst-case 64-bit memory scaling, the total comes out to 960MB RAM, so total memory is a non-issue.
Typical AMD fanpoi behavior. m25's point was debunked yet you still believe him Baron. I guess you didn't read the other posts?
| Quote : Gahhh you didn't read the article now did you...
|
Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".
Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.
Then there's the second part of this... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.
Not all apps use these optimizations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimization as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).
This exlpains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.
If this is the case, then it might mean we have another r600 here, one that has extreme power in one area, but has a few huge bottlenecks that will completely hamper its performance
Makes me wonder a bit really if we have r600 delays because they can't beat core 2, or if they really do have legit reasons
btw...I fixed most of your spelling errors
Not really.. as I've tried to explain this is only relative to apps that rely on Double Precision Floating Points operations. POV Ray is one such app. So yes Barcelona seems to be bottlenecked so far in this particular app if the explanation given proves to be true.
It makes sense to me. But make no mistake that this is not an indicator of it's Processing power in general usage apps that we enjoy, which are mainly games, video/audio encoding and video/audio transcoding.
Man, I seriously hope that pov-ray has scaling issues, if it doesn't and that's the performance of 16 k10 cores compared to 8 clovertowns (even though they are clocked roughly 45% higher if what I've heard is true), then we should be worried about k10's performance
| Quote :
|
They would have to had used Windows Server 2003 as the OS as Windows XP and Vista will only work with two sockets and the Pov-Ray 3.7 beta has not been released for Linux quite yet (grumble grumble.) The only build that'll work on OS X or Linux is 3.6 and that does not support SMP no matter the OS.
| Quote : Man, I seriously hope that pov-ray has scaling issues, if it doesn't and that's the performance of 16 k10 cores compared to 8 clovertowns (even though they are clocked roughly 45% higher if what I've heard is true), then we should be worried about k10's performance |
It won't be.... at least I don't think it will. I fully expect K10 to obliterate Core 2 per clock and give Penryn a run for it's money.
The Intel link is trustworthy. Add one to the inaccurate Wiki articles total. That is also the first time I can think of that Arstechnica was flat out wrong. I take it the table is yours based on how you were defensive about it. I meant nothing more than to question an inconsistancy I saw. Thank you for clearing it up.
Ryan
Just wondering, does xp pro x64 support more than 2 sockets? Because it's based off of server 2003 from what I hear, it's possible they may have used x64
Must be Server 2003 because according to the M$ website, XP x64 still only supports 2 sockets.
Link
| Quote : Multiprocessing and multicore processor support
|
Ryan
There are 11 identified and unidentified users. To see the list of identified users, Click here.
Please mind
You are about to answer a thread that has been inactive for more than 6 months.
If you still wish to proceed, please ensure that your posting is original and does not duplicate or overlap any prior responses to this thread.
