Tom's Guide > Forum > CPU & Components > CPUs > Intel responds to AMD's POV display.

Intel responds to AMD's POV display.

Forum CPU & Components : CPUs - Intel responds to AMD's POV display.

TomsGuide.com: Over 800,000 questions and answers to address all your high-tech questions. Sign up now! Its free!
Page:    Previous 1 2 Next Bottom Search this thread
Word :    Username :           
 

From our favorite rag..
http://theinquirer.net/default.aspx?article=39896

..."That said, the numbers speak for themselves, 4933, almost 5000, with only 8 cores."

I wonder how AMD is going to react to this. Maybe they will release more info about their test system to show if there is going to be enough headroom to beat Intel or not.

Sponsored Links
Register or log in to remove.
- 0 +

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.

Reply to m25
- 0 +

http://www.uberpulse.com/us/2007/0 [...] s_fast.php
During what time interval does he say it only recognizes only 2 sockets? He says that Intel's Core is only capable of 2 socket, then shows the 4 socket machines, then task manager clearly shows 8 and 16 cores.

Reply to r0ck
- 0 +

Quote :

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.




http://www.povray.org/download/

The most significant change from the end-user point of view between versions 3.6 and 3.7 is the addition of SMP (symmetric multiprocessing) support, which in a nutshell allows the renderer to run on as many CPU's (or cores) as you have installed on your computer. This will be particularly useful for those users who intend purchasing a dual-core CPU or who already have a two (or more) processor machine.

Reply to bonkers
- 0 +

The change from 3.6 to 3.7 refers to memory ueage; 3.6 sucked so much that each thread allocated it's own copy of the object in the RAM, so, to use 8 cores you used 8 times more ram than you would normally use, however, I haven't seen any documentation of POV-Ray referring to the maximum number of cores and sockets, because even some of the most advanced renderers, have the core number limited to 8, and a normal copy of Windows won't let them run more than 2 sockets.

Reply to m25

Quote :

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.



Wonder where you got that info from? On the other hand I opted to email POV-Ray at this address: warp@iki.fi (on their site)... here's the exchange.

My Email:

Quote :

Hi there,

I just wanted to know if POV Ray supports Multi Core CPU's. And also how many seperate sockets does POV Ray support (as I was thinking about getting a 4 sockets Dual Core Opteron setup (8 cores total).

Would POV Ray support such a configuration?

Thanks,



Their Response:

Quote :

Hello,

We apologize for any delay in our response to your question. To answer your question POV-Ray supports an x number of sockets. Which means there are no limits to how many sockets one uses. POV-Ray also supports Multi Core CPU's including newer Intel Cloverton Quad Core CPU's.


Thank you for your patience,

Alex
warp@iki.fi



So the whole AMD K10 Quad Socket setup would not have had any issues being supported by POV-Ray. Of course Windows support is another thing entirely and luckilly for us most of these demo's are run on Server versions of Windows.

Reply to ElMoIsEviL
- 0 +

OK, looks like they support every number of cores, however, it looks a bit of a miraculous scaling from choppy, not to say prohibitive multithreaded support in V3.6 , to an infinity of perfectly smooth, available threads in 3.7 :roll:

Reply to m25
- 0 +

I dont think that an advanced software is limited to 2 sockets (not taking into consideration that socket count would be a weird limitation).

Even Cinebench for example, which is useful for err... zero things does support as many CPUs as you have.

Reply to Ycon
- 0 +

And don't forget that most probably they got the crappy V 3.6 (see the radicalchanges between that and 3.7) for their tests, because 3.7 is still in beta and I doubt they took beta software for that test:
http://www.povray.org/download/

Reply to m25

Quote :

And don't forget that most probably they got the crappy V 3.6 (see the radicalchanges between that and 3.7) for their tests, because 3.7 is still in beta and I doubt they took beta software for that test:
http://www.povray.org/download/



AMD would have taken the beta. AMD and Intel both will use BETA software that suits their needs much like ATi or nVIDIA use BETA alpha veraions of games to try and sell their superiority.

AMD wanted to highlight the power of their 16 cores.. so it's only logical that they would have used version 3.7.

CLICK ME There's your info. AMD was using version 3.7 BETA.

Quote :

It turns out that none of these questions is appropriate. Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access; (2) PoV-Ray SSE seems to be optimized more specifically for Core 2 than anything else, where on K8 it is only about 5% faster than PoV-Ray x87. This is also not going to change with K10.



Quote :

Second, comparing the K10 instruction latency with the K8 instruction latency, we find that K10 has little, if any, improvement on scalar SSE instructions; worse yet, some CVTxx2yy instructions are even downgraded and have longer decode and higher latency. What this shows is that PoV-Ray SSE remains unfriendly to both the K8 and K10 microarchitectures. Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there are some core improvements at work elsewhere inside the K10 design.



So again it's SSE coming to bite AMD in the butt. And this will probably also be AMD's undoing in many professional apps. They really need to get their SSE performance up to par with Intel.

Reply to ElMoIsEviL
- 0 +

Quote :

I dont think that an advanced software is limited to 2 sockets (not taking into consideration that socket count would be a weird limitation).

Even Cinebench for example, which is useful for err... zero things does support as many CPUs as you have.


If you read how the V3.6 worked in multithreading, you won't think any more it's that advanced; you have, say a 1G object to throw on the RAM and make the render calculations on it?!; with v3.6 you have to have one copy of the object FOR EACH THREAD 8O , so basically, a 16 core barcelona needs 16G of RAM instead of 1 needed by v3.7. So if those systems had 6G of RAM and that scene was near 500MB, the barcelona system should have swapped a lot on the HDD.

Reply to m25

Quote :

I dont think that an advanced software is limited to 2 sockets (not taking into consideration that socket count would be a weird limitation).

Even Cinebench for example, which is useful for err... zero things does support as many CPUs as you have.


If you read how the V3.6 worked in multithreading, you won't think any more it's that advanced; you have, say a 1G object to throw on the RAM and make the render calculations on it?!; with v3.6 you have to have one copy of the object FOR EACH THREAD 8O , so basically, a 16 core barcelona needs 16G of RAM instead of 1 needed by v3.7. So if those systems had 6G of RAM and that scene was near 500MB, the barcelona system should have swapped a lot on the HDD.

Good thing it used version 3.7 :)

Reply to ElMoIsEviL
- 0 +

Quote :

AMD wanted to highlight the power of their 16 cores.. so it's only logical that they would have used version 3.7.


AMD only wanted to show a 2X scaling within the same TDP and however, seeing the block diagrams of both the K8 and K10, I just can't believe their SSE performance is identical, like one of the interpretations of that demo can make us think. At the end, it's all assumptions and really stupid to talk about unspecified CPUs, at unspecified clock rates running unspecified software.

Reply to m25

Well you should at least read that post. Because it does explain why Intel's Core 2 does so well under POV-Ray. Basically POV-Ray is very optimized for SSE (SSE2 etc). As such any processor able to execute SSE code the fastest will have an advantage. Especially a processor such as the Core2 who's entire design is about efficiency.

Most professional applications are optimizes heavilly for SSE (like 99.9% of them).

So AMD really needs to do something about their SSE perfomance... if not then Penryn will eat K10 alive (full SSE4 on Penryn) under the majority of applications.

What remains to be seen is the per clock performance of K10 which I do believe to be superior without a doubt over Core 2.

Reply to ElMoIsEviL
- 0 +

Quote :

Good thing it used version 3.7 :)


How you please, but till we get our feet on the ground, we're going more or less like this:
http://www.cedesign.com/davidmac/assets/images/Terrace.jpg

Reply to m25
- 0 +

Quote :

Well you should at least read that post. Because it does explain why Intel's Core 2 does so well under POV-Ray. Basically POV-Ray is very optimized for SSE (SSE2 etc). As such any processor able to execute SSE code the fastest will have an advantage. Especially a processor such as the Core2 who's entire design is about efficiency.


And isn't barcelona built for SSE efficiency too, having double the number of SSE engines, better prefetches and things like this; maybe not better than Core2, but how can that thing perform THE SEAME AS A K8 :?: :!:

Quote :


Most professional applications are optimizes heavilly for SSE (like 99.9% of them).
So AMD really needs to do something about their SSE perfomance... if not then Penryn will eat K10 alive (full SSE4 on Penryn) under the majority of applications.
What remains to be seen is the per clock performance of K10 which I do believe to be superior without a doubt over Core 2.


POV-Ray is only optimized up to SSE2, which AMD supports since the first K8s and it scales almost linearly in performance with Intel CPUs; so no need to mention SSE4 here because it really does not mater. Current professional SW today are mostly optimized for SSE2, because this ensures a wider compatibility starting from the first P4s and K8s. SSE3 is just starting to be used in such software and SSE4 will take it's time.
So, you still want us to go like this
http://www.cedesign.com/davidmac/assets/images/Terrace.jpg
there is nothing solid to base this discussion till we have real numbers :D [/quote]

Reply to m25

Gahhh you didn't read the article now did you...

I'll post again...

Quote :

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;



Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
http://img.photobucket.com/albums/v51/ElMoIsEviL/march-comp-sm.png

Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.

Reply to ElMoIsEviL
- 0 +

Wait; who said Core2 performs 5 operations and K10 only 3; from all the articles I have seen, they process theoretically the same number of SSE instructions per cycle :roll:

Reply to m25

You're thinking IPC wise (Integer's per Cycle)... not Double Precision FP wise.

Reply to ElMoIsEviL
- 0 +

Quote :

OK, looks like they support every number of cores, however, it looks a bit of a miraculous scaling from choppy, not to say prohibitive multithreaded support in V3.6 , to an infinity of perfectly smooth, available threads in 3.7 :roll:


That's simply what you get out of correctly supporting SMP in a multithreaded application. Sometimes the tiniest code changes can cause a huge performance and resource-usage impact. POVRay 3.7 exhibits 2x scaling while v. 3.6 as well as Cinebench exhibit ~1.8x scaling at best.

Quote :

AMD wanted to highlight the power of their 16 cores.. so it's only logical that they would have used version 3.7.

http://abinstein.blogspot.com/2007 [...] -amds.html

There's your info. AMD was using version 3.7 BETA.


That's the best analysis I've seen so far and basically says all the marketing speak about 2x bandwidth, double-width FP units, and single-cycle SSE doesn't tell the full story in an "SSE2-supporting" application.

But another possibility to consider besides writing off K10 SSE units as broken (perhaps poorly reverse engineered) is that AMD hasn't had the resources to develop compilers like Intel does. There may be an implementation that lets K10 shine like C2D, but without a compiler to make it happen, K10 is stuck running code suboptimally.

Reply to Wr

Quote :

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.



Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.

Reply to BaronMatrix
- 0 +

Quote :

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.



Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.I'm rather surprised that AMD chose POVray over Sciencemark to highlight their architectural advantages. I guess scaling was their main criteria, for the demo, and maybe Sciencemark isn't showing the same degree of scaling. :? It's inarguably known that the K8(and i would assume K10) arch. owns the Sciencemark suite. :wink:

Reply to 1Tanker
- 0 +

Quote :

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.



Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.

You should have read the whole thread before responding.

Reply to turpit
- 0 +

Quote :

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.



Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.

Dum, read the whole thread why don't ya. Point debunked.

Reply to r0ck
- 0 +

Quote :

That's simply what you get out of correctly supporting SMP in a multithreaded application. Sometimes the tiniest code changes can cause a huge performance and resource-usage impact. POVRay 3.7 exhibits 2x scaling while v. 3.6 as well as Cinebench exhibit ~1.8x scaling at best.


It's not simple speed scaling I am talking about (I am also overlooking their stupid comments after which POV-Ray usually gains 100% 8O with doubling cores and those 'inexperienced' sites have tested on slower CPUs-don't know why the heck a slower CPU is 20% less efficient than a faster one- and gotten ~85% of gain)
; v3.6 uses a hell lot more memory than v3.7 ; it uses the scene's memory multiplied by the cores that render it (each core needs it's own copy), while v3.7 uses only one copy for all the threads.
In simple words; If THAT quad socket K10 system (with 6G of RAM) was using v3.6, the amount of RAM has been multiplied by 16, more or less placing a warranty on heavy HDD swapping.

Reply to m25

As mentioned before, Povray 3.6 doesn't support SMP.

Reply to accord99

Any good version that I can my desktop on? You know, just to test my old beater.

Reply to chuckshissle

Quote :

In simple words; If THAT quad socket K10 system (with 6G of RAM) was using v3.6, the amount of RAM has been multiplied by 16, more or less placing a warranty on heavy HDD swapping.



In even simpler words, if AMD was dumb enough to do that then they deserve all this bad press about 10 times over ;)

Reply to SMU_Pony

Quote :

Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;



Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
http://img.photobucket.com/albums/v51/ElMoIsEviL/march-comp-sm.png

Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.

I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.

Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info

Ryan

Reply to rninneman
- 0 +

Quote :

That's simply what you get out of correctly supporting SMP in a multithreaded application. Sometimes the tiniest code changes can cause a huge performance and resource-usage impact. POVRay 3.7 exhibits 2x scaling while v. 3.6 as well as Cinebench exhibit ~1.8x scaling at best.


It's not simple speed scaling I am talking about (I am also overlooking their stupid comments after which POV-Ray usually gains 100% 8O with doubling cores and those 'inexperienced' sites have tested on slower CPUs-don't know why the heck a slower CPU is 20% less efficient than a faster one- and gotten ~85% of gain)
; v3.6 uses a hell lot more memory than v3.7 ; it uses the scene's memory multiplied by the cores that render it (each core needs it's own copy), while v3.7 uses only one copy for all the threads.
In simple words; If THAT quad socket K10 system (with 6G of RAM) was using v3.6, the amount of RAM has been multiplied by 16, more or less placing a warranty on heavy HDD swapping.

With all due respect - I believe that most of the people got the point about 3.6 using A LOT of RAM the first time You mentioned it and the rest got it with the second post on the subject. It is a fact that the 3.7 beta was used for the test and it made use of all available cores (i.e. 16). 4933>4000.

Reply to -xyzak-

I think AMD is sane and wanted to show a scaling improvement. Since we dont know the freq of any of these cpu's then it really isnt showing anything more than the scaling improvements. As has been said AMD woyuld truly deserve to be kicked if they tried to showcase this as an improvement in performance other than scaling

Reply to jaydeejohn
- 0 +

:lol: The only one who didn't get it is you. Did you miss the:

Quote :

...more or less placing a warranty on heavy HDD swapping.


If that system used HDD swapped memory after it finished ALL the RAM, the render time can slow down VEEEEERY much, because the RAM bottlenecks the CPU and does not really matter if the CPU is worse or better than the other; if it starts reading from the HD :roll:
Short story: Last week I had to render a file which, in pre-render phase was 1.2GB. My 3.0GHz P4 at work, with 2G RAM crunched one shot in ~20 minutes while my home X2 4200+ (much faster but with only 1G of RAM) only managed to do it in 28 minutes, and if you opened the task managet, you could easily understand why :wink:

Reply to m25

Quote :

Any good version that I can my desktop on? You know, just to test my old beater.


To get Povray running, you need to get the 3.6 executable package from:

http://www.povray.org/download/

then get the 3.7 zip package:
http://www.povray.org/beta/

Install 3.6 then unzip 3.7.

At least they finally updated the beta so that it wasn't already expired and forced one to change the computer clock just to get it to run.

Reply to accord99

Quote :

:lol: The only one who didn't get it is you. Did you miss the:
...more or less placing a warranty on heavy HDD swapping.


If that system used HDD swapped memory after it finished ALL the RAM, the render time can slow down VEEEEERY much, because the RAM bottlenecks the CPU and does not really matter if the CPU is worse or better than the other; if it starts reading from the HD :roll:
The Povray benchmark uses about 30MB of memory. Here Xbitlabs' V8 system scored 4677 with 4GB of memory:

http://www.xbitlabs.com/articles/c [...] v8_11.html

Reply to accord99

Quote :

Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;



Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
http://img.photobucket.com/albums/v51/ElMoIsEviL/march-comp-sm.png

Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.

I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.

Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info

Ryan

Wanna know what's funny.. looks like Arstechnica got it wrong. The P6 does in fact have 10 stages.
Virginia.edu

And also here.. Wikipedia P6

Quote :

Superpipelining, which increased from Pentium's 5-stage pipeline to 14 of the Pentium Pro, and eventually morphed into the 10-stage pipeline of the Pentium III, and the 12- to 14-stage pipeline of the Pentium M.

Reply to ElMoIsEviL
- 0 +

Quote :

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.



Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.
Is this yet another case of seeing only what you want to see? POVRay has no documented issue scaling past 2 sockets, yet you believe m25 right away.

I don't think there can be a dispute that the machine on the left loaded 8 cores and the machine on the right, 16. Watch the original video again - http://www.youtube.com/watch?v=VGiv9Dtrc5Q. From 0:45 to 1:10, pause or take a snapshot and count up the columns representing cores in task manager - on the left, clearly 8, and on the right, I know the video is blurry, but it's at least pretty close to 16.

Observe that the overall load represented by the separate green bars at the left of task manager remained at around 100%. That proves there was no substantial penalty from disk swapping during the videotaped portion. The benchmark only takes <30MB here on 32-bit Windows 2003 Server (probably the OS they were using); multiplying that by 16 cores and 2x for worst-case 64-bit memory scaling, the total comes out to 960MB RAM, so total memory is a non-issue.

Reply to Wr

Quote :

Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;



Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
http://img.photobucket.com/albums/v51/ElMoIsEviL/march-comp-sm.png

Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.

I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.

Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info

Ryan

Wanna know what's funny.. looks like Arstechnica got it wrong. The P6 does in fact have 10 stages.
Virginia.edu

And also here.. Wikipedia P6

Quote :

Superpipelining, which increased from Pentium's 5-stage pipeline to 14 of the Pentium Pro, and eventually morphed into the 10-stage pipeline of the Pentium III, and the 12- to 14-stage pipeline of the Pentium M.



I wouldn't trust the PowerPoint from UVa either. It doesn't cite any sources or the credentials of the author (or even who the author is for that matter.) Plus it contradicts your table. The PPT says that Netburst had a pipeline depth of 20, yet the table says 21+8.

I don't think Wiki can be trusted here either. (We all know all too well that Wiki is to be taken with a grain of salt.) It states that the Pentium Pro was 14 stages and P3 was 10. Heck, the Wiki article even contradicts itself regarding the length of the PM pipeline. I cannot find any other source that states the pipeline was changed from PP to P3.

I only brought it up because I don't know how relevant the table is to the thread if it's not even accurate.

Ryan

Reply to rninneman

What I meant was, K8 didnt perform with sse well and apparently neither does K10.

Quote :

Second, comparing the K10 instruction latency with the K8 instruction latency, we find that K10 has little, if any, improvement on scalar SSE instructions; worse yet, some CVTxx2yy instructions are even downgraded and have longer decode and higher latency. What this shows is that PoV-Ray SSE remains unfriendly to both the K8 and K10 microarchitectures. Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there are some core improvements at work elsewhere inside the K10 design.

They are scaling somehow tho, using what improvements? Ideas? I know pov isnt a K8 friendly, so maybe without any changes (sse,smp ) it still scales well.

Reply to jaydeejohn

Quote :

Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;



Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
http://img.photobucket.com/albums/v51/ElMoIsEviL/march-comp-sm.png

Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of thise... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimisations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimisation as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.

I think the table you provided has some errors. According to ARS Technica, the P6 has a 12 stage pipeline and the Pentium-M pipeline depth was never made publicly available. Off the top of my head, I don't know what else, if anything, is inaccurate, but you may want to check the source where that table was obtained. Chances are there are more errors.

Here are the ARS links:
P6 12 stage pipeline
Pentium-M info
More Pentium-M info

Ryan

Wanna know what's funny.. looks like Arstechnica got it wrong. The P6 does in fact have 10 stages.
Virginia.edu

And also here.. Wikipedia P6

Quote :

Superpipelining, which increased from Pentium's 5-stage pipeline to 14 of the Pentium Pro, and eventually morphed into the 10-stage pipeline of the Pentium III, and the 12- to 14-stage pipeline of the Pentium M.



I wouldn't trust the PowerPoint from UVa either. It doesn't cite any sources or the credentials of the author (or even who the author is for that matter.) Plus it contradicts your table. The PPT says that Netburst had a pipeline depth of 20, yet the table says 21+8.

I don't think Wiki can be trusted here either. (We all know all too well that Wiki is to be taken with a grain of salt.) It states that the Pentium Pro was 14 stages and P3 was 10. Heck, the Wiki article even contradicts itself regarding the length of the PM pipeline. I cannot find any other source that states the pipeline was changed from PP to P3.

I only brought it up because I don't know how relevant the table is to the thread if it's not even accurate.

Ryan

The table is 100% accurate.

Manchester.edu
INTEL.com <------

Quote :

The idea of out-of-order execution, or executing independent program instructions out of program order to achieve a higher level of hardware utilization, was first implemented in the P6 microarchitecture. Instructions are executed through an out-of-order 10 stage pipeline in program order.



Netburst Northwood has a 21 stage pipeline (varying on how you look at it)
Intel.com Willamette had a 20 stage pipeline.

The +8 represents the much despised and ridiculed mis-predictions. It's not an accurate number but rather an estimate see although the whole 20-21 stages are misprediction pipelines, the Netburst architecture actually mispredicted quite often, more so then any other architecture thus far. Although Intel's Rapid Fire Execution engine (ALU's running at twice the Clock speed) helps alleviate some of the cost in performance from a misprediction it's not enough. The architecture itself is VERY inneficient.

Reply to ElMoIsEviL

Quote :

Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;



Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
http://img.photobucket.com/albums/v51/ElMoIsEviL/march-comp-sm.png

Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of this... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimizations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimization as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlpains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.
If this is the case, then it might mean we have another r600 here, one that has extreme power in one area, but has a few huge bottlenecks that will completely hamper its performance

Makes me wonder a bit really if we have r600 delays because they can't beat core 2, or if they really do have legit reasons

btw...I fixed most of your spelling errors :wink:

Reply to I_Love_Tacos

Quote :

I have said it from the start; in the presentation, the guy says that POV-Ray only recognizes 2 sockets, so that demo was 4 (2sockets x 2cores) K8s and 8 (2sockets x 4cores) K10s, not 8 vs 16 cores, and if that was even the 2.1 GHz Barcelona (not the lowest clocked 1.9 one), a 2.5GHz barcy will improve that figure by 23%, pulverizing Intel's score, above 5000. Again, till we have crisp numbers, these are only assumptions. However, if that is the score of a 3 GHz, 8core V8, a 2.5GHz, 4x4 Agena, will do better IF, the one that scored ~4000 was @ 2.1GHz or lower. So, at the end, this is more a point in favor of AMD rather than Intel.



Finally a straight answer about the multithreading of POV-Ray. I was wondering because a QFX system got just under what a quad Opteron did in the demo. That means 4-8 core scaling in the same envelope.
Is this yet another case of seeing only what you want to see? POVRay has no documented issue scaling past 2 sockets, yet you believe m25 right away.

I don't think there can be a dispute that the machine on the left loaded 8 cores and the machine on the right, 16. Watch the original video again - http://www.youtube.com/watch?v=VGiv9Dtrc5Q. From 0:45 to 1:10, pause or take a snapshot and count up the columns representing cores in task manager - on the left, clearly 8, and on the right, I know the video is blurry, but it's at least pretty close to 16.

Observe that the overall load represented by the separate green bars at the left of task manager remained at around 100%. That proves there was no substantial penalty from disk swapping during the videotaped portion. The benchmark only takes <30MB here on 32-bit Windows 2003 Server (probably the OS they were using); multiplying that by 16 cores and 2x for worst-case 64-bit memory scaling, the total comes out to 960MB RAM, so total memory is a non-issue.

Typical AMD fanpoi behavior. m25's point was debunked yet you still believe him Baron. I guess you didn't read the other posts?

Reply to clairvoyant129

Quote :

Gahhh you didn't read the article now did you...

I'll post again...

Because - (1) PoV-Ray's usage of SSE2 is not SSE (Stream SIMD Execution) at all, but really double-precision FP with random register access;



Here's a nice diagram for you.. it contains AMD's K8 vs. Intel Core (not to be confused with Core 2).
http://img.photobucket.com/albums/v51/ElMoIsEviL/march-comp-sm.png

Now look at the segment on the lower part of that image where we see "MAX DP FP / CYCLE".

Even Intel's Core architecture can do a better job then AMD's K8. Core 2 can do 5 per core per clock. K10 is no different then K8 in this respect and can only process a maximum of 3 per core per clock.

Then there's the second part of this... Random register Access.. Core 2 is 400% faster then K8 at this, K10 is twice as faster as K8 at this thus half the speed of Core 2 per clock.

Not all apps use these optimizations, but MANY professional applications (if not most as I stated before) do. Games don't really use this particular optimization as well as the Graphics card (GPU) mainly handles most of the FP load while the CPU concentrates on AI and for now physics (Mainly high Integer load).

This exlpains why AMD choose to target it's integer performance and ignored this one downfall and also why you should read the article.
If this is the case, then it might mean we have another r600 here, one that has extreme power in one area, but has a few huge bottlenecks that will completely hamper its performance

Makes me wonder a bit really if we have r600 delays because they can't beat core 2, or if they really do have legit reasons

btw...I fixed most of your spelling errors :wink:

Not really.. as I've tried to explain this is only relative to apps that rely on Double Precision Floating Points operations. POV Ray is one such app. So yes Barcelona seems to be bottlenecked so far in this particular app if the explanation given proves to be true.

It makes sense to me. But make no mistake that this is not an indicator of it's Processing power in general usage apps that we enjoy, which are mainly games, video/audio encoding and video/audio transcoding.

Reply to ElMoIsEviL

Man, I seriously hope that pov-ray has scaling issues, if it doesn't and that's the performance of 16 k10 cores compared to 8 clovertowns (even though they are clocked roughly 45% higher if what I've heard is true), then we should be worried about k10's performance

Reply to I_Love_Tacos

Quote :


So the whole AMD K10 Quad Socket setup would not have had any issues being supported by POV-Ray. Of course Windows support is another thing entirely and luckilly for us most of these demo's are run on Server versions of Windows.



They would have to had used Windows Server 2003 as the OS as Windows XP and Vista will only work with two sockets and the Pov-Ray 3.7 beta has not been released for Linux quite yet (grumble grumble.) The only build that'll work on OS X or Linux is 3.6 and that does not support SMP no matter the OS.

Reply to MU_Engineer

Quote :

Man, I seriously hope that pov-ray has scaling issues, if it doesn't and that's the performance of 16 k10 cores compared to 8 clovertowns (even though they are clocked roughly 45% higher if what I've heard is true), then we should be worried about k10's performance



It won't be.... at least I don't think it will. I fully expect K10 to obliterate Core 2 per clock and give Penryn a run for it's money.

Reply to ElMoIsEviL

The Intel link is trustworthy. Add one to the inaccurate Wiki articles total. That is also the first time I can think of that Arstechnica was flat out wrong. I take it the table is yours based on how you were defensive about it. I meant nothing more than to question an inconsistancy I saw. Thank you for clearing it up.

Ryan

Reply to rninneman

Just wondering, does xp pro x64 support more than 2 sockets? Because it's based off of server 2003 from what I hear, it's possible they may have used x64

Reply to I_Love_Tacos

Must be Server 2003 because according to the M$ website, XP x64 still only supports 2 sockets.

Link

Quote :

Multiprocessing and multicore processor support
Windows XP Professional x64 Edition is designed to support up to two single or multicore x64 processors for maximum performance and scalability.



Ryan

Reply to rninneman
Previous
1 2
Tom's Guide > Forum > CPU & Components > CPUs > Intel responds to AMD's POV display.
Go to:

There are 11 identified and unidentified users. To see the list of identified users, Click here.

Please mind

You are about to answer a thread that has been inactive for more than 6 months.
If you still wish to proceed, please ensure that your posting is original and does not duplicate or overlap any prior responses to this thread.

Add a reply Cancel
Google ads