David Oguns' Computer Graphics II Webpage
Related Links

RIT Homepage
RIT CS Department
Professor Carithers
Project Report

Here is the MS Word version of my project report.

Project Midquarter Update

David Oguns
Computer Graphics II (4003-571)
Warren R. Carithers
April 17, 2008

Optimizing Ray Tracing on the Cell Microprocessor

The objectives of this project are to learn how to write programs for an asymmetric multicore processor; to learn how to write SIMD based code to increase the performance of graphics applications; and to explore just how much performance is available in the Cell processor's architecture.

The division of labor for this project has not changed since I am working alone on it. In order to port the project to the Cell microprocessor, it was necessary to use the netpbm library to produce the output of the ray tracer due to the lack of 3D acceleration features in Linux on the Playstation 3. I have yet to fully explore what these limitations are. If they are resolved, I could potentially show a real time demo using GLUT assuming performance allows for it. One of the two “meaty” tasks involved in this project is SIMDizing the ray tracing code and spreading out the work across the SPEs and the PPE using the Altivec intrinsics made available through libraries in IBM's Cell SDK 3.0. The other is establishing communication between the SPEs and PPE to efficiently divide the work and load the scene data to each SPE's local store. This task seems reasonable enough but I expect it will be the most difficult part of the project considering the restrictions on DMA transfers on the Cell BEA and the performance implications of using certain methods.

So far, I have added the netpbm output methods for the ray tracer output in addition to GLUT. This allows me to relatively easily recompile the ray tracer (written completely in C) and simply run the code on the PPE on the Cell processor. I have done this and checked the output. From one perspective, I have accomplished what I expected to have completed by week 5 with the exception being that no code is running on the SPEs yet. I have also started to SIMDize some of the vector functions using the C language intrinsics that allow me to avoid having to write assembly language to take advantage of the SIMD operations possible on the PPE and SPEs. Now that I have made a bit of progress in the project and I have a better feel for the tools involved, I have a better idea of what needs to be done with finer granularity. From here on out I expect to have all of the vector functions SIMDized and ready to be run on both the SPEs or the Altivec hardware on the PPE by the end of week 7. I also will have at least the start of SPE program execution running on an SPE. By the end of week 8 or halfway through week 9 I hope to have the scene loaded properly to each SPE ready to run the ray tracing process and easily output the resulting pixel value to the frame buffer directly in main memory. In the last week or week and a half before presentation, I hope to progressive increase performance and debug any issues that may come up along the way. If there is time, I could possibly setup a real time demo with the light and/or one of the spheres moving as a demo for class. Such a demo may not be reasonable though as the Linux environment on the PS3 naturally runs slow and has limited memory with Xorg running.

So far I am relatively pleased with the progress I have made. I am behind where I hoped to be at this point, but I have gotten a lot familiar with the tools and have discovered approaches that can reduce the amount of work I have to do significantly. As I move forward, progress should definitely speed up.

Project Proposal

David Oguns
Computer Graphics II (4003-571)
Warren Carithers
March 23, 2008

Optimizing Ray Tracing with the Cell Microprocessor

This purpose of this is to implement a ray tracer to run on the Cell BE architecture. The Cell BE architecture was developed by the STI group (Sony Toshiba IBM) to offer efficient high performance computing at low cost. It is an asymmetric multi core processor with one scalar processing unit based off IBM's Power5 architecture called the Power Processing Unit(PPU), and an array of super scalar(vector) processing units called the Synergistic Processing Unit(SPU). This makes the Cell a great fit for applications that have high levels of process and instruction level parallelism.

I will be rendering the same 3D scene as the one we will be doing for the primary ray tracing assignment in class. For this project, I will have to essentially have to implement the same functionality as the ray tracing assignment in class so I will be touching on the same aspects of the rendering pipeline. This project will not emphasize any part of the rendering pipeline, but rather it will focus on parallelizing the ray tracing algorithm to run across multiple cores and accelerating some of the math code to run optimally on super scalar hardware.

The primary objective of this project is to explore the computing power of the Cell microprocessor in 3D rendering applications. The first goal in order to do this is distributing work across multiple cores to accomplish a single task. This can be done in multiple ways given the internal architecture of the Cell. I will attempt to do this simply by assigning each SPU the task of calculating the color of a pixel or group of pixels. The PPU will simply divide the work across the SPUs and wait for them to complete. There will be no SPU to SPU communication, although there is the potential to explore methods of ray tracing that do this for greater performance.

This project requires computing hardware with the Cell microprocessor in order to run with accurate performance results. I will be using a Playstation 3 as a production machine to run the final product. In order to develop applications for the Cell, specific tools and libraries need to be used to allow applications to be written in C or C++ and compiled to run on the PPU and SPUs on the Cell processor. These tool chains provide special intrinsics to allow vector operations to be executed on the Altivec hardware in the PPU or the SPUs without writing assembly code. These tools are part of IBM's Cell BE SDK which is currently at version 3.0. Most of the development will be done in Fedora Core 7 using this SDK and tested using both the Playstation 3 and IBM's System Simulator environment since it has enhanced debug features.

The components of this project are very simple. It will be two main executables. One executable will be the driving PPU program which will handle the setup, initialization, and producing the final output of the ray tracer. The SPU executable will be loaded dynamically from the PPU executable and will run the actual ray tracing algorithm and features that calculate the pixel color values. When the SPU is done calculating the final value of a pixel, it will write the output to main memory directly through DMA transfers.

Since this project is supposed to produce the same output as the ray tracing assignment with the same input, the milestones could be aligned with that assignment. However, due to the additional complexities of DMA transfers and trying to parallelize the code on an instruction level, I expect the Cell version of the ray tracer to fall significantly behind the in class assignment. I expect to be able to have checkpoint 2's functionality running on the Cell by week 5. After that, each checkpoint should be progressively easier to implement unless the memory requirements of the SPU local store is exceeded by the ray tracing code running on it. By the end of week 9 I expect to have the same level of functionality required by checkpoint 5 or 6 running on the Cell processor.

This project is not the best suited for interesting visuals during presentation, so I plan on explaining the basic algorithm I implemented to run the ray tracer on the Cell. I will cover why the Cell lends itself well to the task of ray tracing and what degree of optimization I believe I obtained with my implementation. I will also talk about some of the development challenges I had porting it and show some performance comparisons between the two versions of the ray tracer I implemented.