A question for the framework/VM guys...
I have ported one of my simpler benchmarks to the 360, and find that the numeric performance is running 5x slower than my 2.8GHz P4 desktop. Now I was expecting a drop in performance on the 360, allowing for the difference in CPU implementation (in order execution / branch prediction etc), but nothing like this magnitude!
The benchmark simply transforms a source array of Vector4s to a destination array of Vector4s, using a 4x4 matrix.
Have you guys done any benchmarking of pure numeric perf, and if so is this in line with what you get
Andy.

Numeric performance on 360
emad masri
Judging from my own performance work, raw math performance on 360 tends to be slower than on a Windows machine with a comparable clock speed, but not by a huge margin. The 360 framework is much more sensitive to coding style, though, so things that would maybe only cause a 10% performance drop on Windows can easily cause a huge slowdown on Xbox.
The two biggest things to be aware of when optimising Xbox math code are:
- Passing structures by value is slow. For performance critical
code that operates on vectors or matrices, you should pass arguments by
ref, and use out parameters instead of return values.
Both of those techniques make your code more complex and harder to maintain, so I wouldn't recommend using them everywhere. Judiciously applied in the right parts of your inner loops, manual inlining and reference calling convention can give huge speedups.I'll also second what Jack says about taking advantage of the 3 processor cores and especially the GPU. The trick to getting really amazing performance on Xbox is finding ways to offload as much computation as possible onto the GPU.
smeets116
In-order execution can be mitigated by a good instruction scheduler within the compiler, even more so by a developer that is willing to write assembly and work around instruction latency. Sadly I don't think either of the .NET VMs do any instruction scheduling...
fafnir
For those of you already running 1.0 on Creator's Club: what kind of information do you get from the Xbox performance analysis tool
WhitebearJPN
OhDuck
A.M.
Hi again
@Adam M - this code is running on a separate thread behind a splash screen, but isn't itself split into multiple threads. I definitely want to do that, but not until I get to the bottom of why it's so slow at the moment. If I can get a single thread down to ~20secs then it would make sense to try and parallelise that.
I spent a pretty long time today messing with inner loops, and didn't really get anywhere. My one bit of advice to others with similar problems is PROFILE NOW! The big slowdowns weren't at all where I expected them.
@John W - I haven't got onto the perf tool you mention yet, I've been using DateTime.Now and TimeSpan to increment totalTime variables. I'll have a look at the remote perf monitor tomorrow.
The biggest chunk of time (1/3 of the total time) is spent in a routine which adds vertex/index chunks from a library of pieces (parsed from a 3dsmax file during Init(), using my own exporter/importer) to a single mesh vb/ib - adding correct offsets to the indices as neccessary. I'm doing various horrible things like using my own array.join() method to combine ie the current array of indices with a new chunk of indices - so there may be big wins there.
Other methods are still between factors of 10 and 40 times slower though, so I'm a bit worried. A final idea is to decode a small bit of the world initially, then stick all this stuff in a separate thread which runs while you're playing, and decode the further reaches of the world before you actually get there. I initially did this for the PC, but after a healthy dose of refactoring and getting it down from 1 minute to 9 seconds, I hoped this wouldn't be neccessary. Live and learn : )
Anyone, please ask if you'd like further info on what I'm trying to achieve. I'm not a position to open source this stuff, but would love to share as much info as possible.
Cheers!
dalterio
Okay, some hard numbers from my game. Basically, I'm decoding some data, creating meshes from some previously-loaded building blocks (then welding verts etc), and creating shadow volume meshes from these. Also, this is all using arrays of my own vert structs and ints for indices - vertex and index buffers get built right at the end.
PC, creating 386 objects -
decoding : 1437ms (15.65%)
creating mesh : 6312ms (68.71%)
creating shadow mesh : 1437ms (15.65%)
Xbox, creating 386 objects -
decoding : 5275ms (2.5%)
creating mesh : 189978ms (89.71%)
creating shadow mesh : 16508ms (7.8%)
Yeesh...
I'm going to see what works and what doesn't in terms of speeding this up, and report back.
Cheers
laqula
http://dpad.gotfrag.com/portal/story/35372/ spage=1
Quote:
Both the 360 and PS3’s CPUs are heavily stripped down compared to what most of us are probably using on our desktop computers to view this article. Both consoles are labeled as 3.2GHZ, but they don’t offer performance comparable to that of a typical Athlon 64 3200+ or better than even an Athlon XP 2800+ CPU. The CPUs inside the Xbox 360 and PS3 are “In-Order Execution” CPUs with narrow execution cores, whereas what we use on our computers are classified as “Out-of-Order Execution” CPUs with wider execution cores.
GS80
Jim Mace
My apparently-incorrect "structs on the heap" info was from work I did with Compact Framework 1.0, back in the day. Glad to hear it's not a problem for CF 2.0.
lushdog
Fair enough.
Shame MSFT stopped development of the PowerPC verision of Windows NT, perhaps then the desktop VM would have been different!
charles C
Hey dude
I've noticed similar things today - my number crunching stuff (procedurally creating a world from some small data) takes 8s on my year-old PC, and about 4 minutes on the 360.
Will be digging into it tomorrow to see if there are any easy wins, and will report back. My first port of call will be to try passing as much stuff as possible by ref/out to avoid copies.
Perhaps some console coders out there may have some useful hints about what not to do
Cheers
jankowiak
The Xbox 360 CLR runtime is based upon the .NET Compact Framework 2.0 runtime, which is not currently optimized for either floating point performance or for processing structs. In particular, structs are heap allocated on the .NET Compact Framework, rather than being held on the stack. This makes structs much slower to create and pass around to subroutines than on the Windows PC.
If your heavily-used code creates and passes around lots of structs, you may find a performance improvement on the 360 by rewriting it to use individual floats & ints. (Yes, this is a super pain. I'm just mentioning it as something that might help.)
A second thing to do is try and take advantage of the 360's 3 CPUs and 6 hardware threads. Try rewriting your code to take advantage of multiple threads. I bet you could get a 2x speedup by doing this.
Another approach would be to try to shift computation to the GPU by using pixel shaders and render-to-texture. For some kinds of algorithms the pixel shader can be much faster than the .NET Compact Framework 2.0 runtime.
Airan