August 27th, 2009
When generated with both CPU and GPU it can be easily seen that GPU uses simplified IEEE-754 (float) implementation (upper is CUDA).

I tested the program on 3Ghz * 8 core computer with a "rather slow" graphic card (it had to run on 1x pci-e slot and those are expensive and not game-oriented.) - Quadro NVS 290.
When only one core was at work the GPU clearly wins, second is my asm - yay. GCC generated 32bit and 64bit versions when both are using SSE doesn't seem to diverge. 64bit was a tiny bit faster. x87 lose.

When ran with one thread per core, the CPUs worked much faster than my GPU, and my asm worked mostly faster than plain GCC-SSE version, yet 8cpus + gpu outperformed them all. Pity CUDA didn't support crosscompilation. I'd have tried my asm + CUDA.

Rendering a 200 frames of a movie. Not sure why CPU+GPU takes such a long time. I've probably messed something up.

Profiler output. First six functions (one bar = one function) were implemented in asm. On vertical axis - percent of total program time.

Presentation (In Polish! But has some math and other pictures; made with LaTeX-Beamer): presentation.pdf
Project summary (In Polish!): sprawozdanie.pdf
Pack of code (without docs etc., GPL3+): JuliaTracer.tar.bz2 and signature.
Disclaimer: This was my first CUDA program ever, it was fun to write but I'm sure any CUDA-magician could have written this part much better, therefore don't treat those benchmarks seriously.
This is a mail of my good friend who loves rubbish email. Send him some if you want to land on my not-so-welcome lists: John Sparrow john@thera.be john(at)thera.be
Static pages
Tags
Newest articles
Recently updated
External links
Add a comment [+] Hide the comment form [-]