An enhanced OpenGL renderer for Unreal Tournament News Archive


1-15-2006
Version 3.2 is released. It has some new SSE2 code in few places. Minor improvements were made to some of the existing assembly code. A few rdtsc instructions used for profiling that negatively impacted performance a bit too much in some cases were removed. A few other mostly minor changes were made.

While looking through older renderer code (D3D/Glide stuff), I noticed that the DrawTile path was clamping colors. I didn't do this in the BufferTileQuads path up until now. Although I never saw it causing any problems, I added clamping code in version 3.2 just in case. This is a little slower of course (even though still faster than previous similar code), but pulling the rdtsc instructions in the same area should help balance things out. I also added some SSE2 code to the DrawTile path for buffering color data. This code can do the clamping at no added cost and is a bit faster than the previous code, both clamping and no clamping versions.

The source code now includes VC6 project files with various updates. Unfortunately, this release won't build correctly with the VC8 compiler version 14.00.50727.42 due to a code generation bug with optimizations enabled. The does far too many extra moves with SSE/SSE2 intrinsics register allocator in previous compiler releases appears to have been at least partially fixed in VC8, well sort of. Now it generates incorrect code when telling it to do unaligned loads. I see no good way to work around this since the problem still occurs with just a simple single load followed by single store code snippet. So, if you try to build the renderer with any VC8 compiler with this problem, either disable all of the SSE/SSE2 code with the ifdef, or make the necessary modifications to only remove the SSE2 code in the DrawTile path.

12-4-2005
The last thing I was working on was replacing a couple major functions in render.dll. I never got it to be completely stable and it didn't handle a few special effects correctly, but I was able to run a few tests with it on frames that it did draw identically. After a while, I decided to not bother with trying to finish it since I don't play anymore, but I was able to test a few things of interest. Most of the details don't really matter, but without spending too much time optimizing parts of it (after spending a lot of time trying to make it work the same as the original...), I was able to increase the frame rate by up to 5% in mesh heavy frames.

Fixing the TruForm problem with incorrectly applying it to non-character meshes only took adding a new flag bit and a few simple checks. The other problem with corrupt triangles that spanned the edge of the screen when TruForm was enabled was automatically fixed as part of the optimization to not spend time clipping these in software. Of course now that ATI has pulled TruForm support from their drivers, fixing these glitches isn't so important anymore.

I was supposed to be done with renderer updates, but I might put together one more test build. Although it may have some other (hopefully minor) side effects, I might know how to avoid one of the major remaining game speed problems. This is the one that causes problems on systems that dynamically vary the speed of the CPU's timestamp counter. The major classes of systems affected by this include ones with Pentium M, certain newer P4, and certain newer K8 CPUs.

This isn't really anything that's fixable nicely in the renderer, but by tweaking the right internal flag, this problem might be avoidable, and it's easiest for me to build a setting to do this into a renderer. Unfortunately it will end up with only an up to 1 ms resolution timer instead. So at least it will be stable, but I'm not sure how smoothly it'll work. I observed significant interactions with the frame rate limiter in the renderer in some tests I ran, but the game still seemed to run okay. The better fix is to add an option to use QPC if present, but this code isn't in the renderer, so I can't fix it there (though likely would be easy to patch the right part of some other binary for this one).

5-15-2005
I ran a few tests on the D3D8 renderer built for D3D9, which only required minor modifications. With D3D9, V Sync in a window control is available and it has access to a more rational z-bias implementation. It also tends to run slightly slower.

5-7-2005
I built a new version of the D3D8 renderer. This one is a bit faster with interleaved vertex/color data, larger vertex buffers, and the BufferTileQuads code added. BufferTileQuads is enabled by default in this renderer since not having it hurts D3D a lot more than OpenGL, and I don't have to be concerned about any backwards compatibility issues. I also added a few more features and some minor optimizations. The file is utd3d8r10.zip.

I didn't add paletted texture support to this renderer, so if you have a GeForce1-4 series video card, you should make sure to use the OpenGL renderer and enable the settings that tell it to use paletted textures (these are disabled by default). Also, on other video cards with good enough OpenGL driver support, the OpenGL renderer may be better.

Performance differences between this renderer and the OpenGL renderer are fairly small on my system, though it does tend to be a little slower. It may be possible to improve this in some cases by interleaving the texture arrays, but this is a lot of extra work, so I may not try it. It doesn't help that D3D seems to have poor small batch performance in general due to intrinsic design/implementation characteristics. There's no avoiding this after a certain point since UT has fairly low geometric complexity.

So, D3D is far simpler compared to OpenGL in the feature set it supports on the API side and yet ends up with far worse small batch performance. In various places in the renderer, it's possible to get moderate performance with a minor amount of work using OpenGL, but with D3D, it requires extra work just to make it work at all and end up with poor performance. With either API, it's possible to get higher performance by adding more advanced buffering schemes such as actor triangle buffering, clipped actor triangle buffering, BufferTileQuads, etc. This D3D renderer will be far slower than the OpenGL renderer for line drawing since it lacks advanced buffering in this area. This shouldn't be a problem with the editor because I don't support it with this renderer anyway since selection support is not implemented. Hopefully line drawing isn't used too heavily, or at all, outside the editor.

z-buffer issues
Like the OpenGL renderer, this D3D renderer may have problems with far away decals flickering due to z-buffer precision issues if only a 24-bit z-buffer is available. It doesn't support w-buffering either, though it looks like a lot of newer video cards don't support this feature anyway. It's probably possible to work around this problem in the renderer, though it may not be anything I'll add. Of course if all these new GPUs/VPUs didn't drop support for 32-bit z-buffers, this wouldn't be a problem.

4-26-2005
I finally decided to learn Direct3D in case knowing it would be good for a future job. Porting the renderer only added a few days, with a lot of that time spent dealing with things D3D makes difficult, so I tried building one that uses D3D. D3D has gotten better in recent versions, but some areas are still problematic. I'm sure glad I never used D3D7 or earlier.

This renderer will most likely be slower than the OpenGL one on ATI, NVIDIA, or other graphics cards that at least have reasonably good OpenGL drivers. I also left out a few likely significant optimizations in the current build that may limit its performance. I guess I'll find out later if fixing these can bring it up to the speed of the OpenGL renderer on my system. It uses D3D8 and since it uses certain advanced features, it will not function on various older video cards. Also, due to certain SDK complications, I think it ends up requiring at least DirectX 8.1, which I believe means it will not support Win95.

I added single pass fog mode to this one, since it happened to be easy with D3D. The required blend mode on the OpenGL side requires one extension for NVIDIA, another extension for ATI, and probably just isn't there for various other video cards since providing a standard way to access it on the fixed function side seems to have been forgotten about. It's too bad some of the other vendors didn't at least add support for the ATI version of the extension since it doesn't really add much and their hardware probably supports it all. I'll check the standard extensions again sometime, but I don't think the functionality required for single pass fog in UT is there.

I'm checking a large number of caps bits/values in this build, but a few checks are still missing. I'll probably fix a few of these later, but may leave a few of the more complicated ones out.

Windowed mode, windowed mode resizing, and surviving through various mode switches should work, but some things in this area get awfully difficult to support and test when using D3D. Windowed mode screen shots hopefully work okay, including without crashing in various special cases when the window isn't fully within the screen. D3D still makes something basic like grabbing a copy of what got rendered far too difficult in cases like this.

This initial build of this renderer supports a large number of features, but some are missing at this time.
- Selection support for UnrealEd isn't there. I may never add it, so don't use it with the editor (other functionality should work, but it's not really usable there without this feature).
- S3TC support is there.
- 16 bit texture support is there, but I did the conversions using simple clipping instead of proper rounding.
- Not checking texture aspect ratio restrictions yet, so if any specific requirements here, it may just crash when trying to load certain textures (good chance this may not be an issue on any new enough cards to run this renderer though).
- V Sync on or off request only works full screen. D3D8 doesn't allow something basic like V Sync on or off to be requested when in windowed mode. I believe this got fixed in D3D9.
- All the texture filtering modes and LOD bias should work.
- No paletted texture support, and I'm not sure I'll ever add it to this one.
- Lots of other features are supported, but a few others are not.

11-29-2004
Version 2.8 is released. It contains a couple bug fixes, basic support for 16-bit textures, and various other changes.

The rare SinglePassDetail with OneXBlending disabled bug is fixed. The fix may also optimize away a few low cost state changes.

The bug with a few incorrect gradients showing up in the console that can occur when precaching is enabled is fixed. It actually resulted in a number of textures getting unnecessary higher quality filtering, so this fix could speed things up in some cases, though without higher quality filtering modes enabled, it may make little to no difference on a number of video cards. This one was broken due to previous optimizations, though in a number of cases, the CPU savings may still have been more beneficial than any potential loss due to unnecessary high quality texture filtering. It was also somewhat difficult to fix, which is one reason why it remained broken so for long.

The new option for 16-bit textures is Use16BitTextures. The Use4444Textures option is gone. If mostly video card limited rather than CPU limited, using this new option should speed things up at the expense of reduced texture quality, which varies from case to case. In many cases, there is only minor quality loss. In other cases, like with various skyboxes and coronas, there is often major quality loss.

This basic 16-bit texture support was kept simple by just sending BGRA8 textures to the OpenGL driver and telling it to use RGB5, or RGB5_A1 if masked. It could be made faster if the renderer converted the textures to 16-bit before sending them to the driver, but I didn't want to deal with added complexity in this area right now for various reasons. So, the performance of some aspects of this new feature relies on good format conversion code in the driver, and in some cases it's not there. Enabling this feature will also reduce brightness a little bit, though it's fairly minor (much more noticeable with the old 4444 textures option). From reading the OpenGL specification, it sounds like the color components are supposed to be rounded to nearest during the conversion, but with the NVIDIA, ATI, and Intel drivers I tested, they were truncated, which causes the slight brightness reduction on average.

I ran some specific tests on BGRA8 to RGB5 and RGB5_A1 conversion performance on NVIDIA, ATI, and Intel OpenGL drivers. The results are:

Year or so old NVIDIA drivers on my old system:Good
Current ATI drivers:Bad
Current Intel drivers:Worse

I'm not really surprised that old NVIDIA OpenGL drivers are still superior to current drivers from various other vendors in a number of areas. What did surprise me is just how bad certain parts of ATI's and Intel's OpenGL drivers are. Although this may not be the highest priority path when it comes to performance compared to 16-bit textures coming from 16-bit source data, I'd still consider it to be of moderate importance and something one would hope would be handled reasonably efficiently by the OpenGL driver.

Fortunately, since only a subset of textures is converted to 16-bit when Use16BitTextures is enabled, using precaching should catch most of them. However, it looks like animated textures fall into the 16-bit conversion group, but if going for speed over quality, these may already be disabled.

I didn't add the more complicated dynamic scaling of 16-bit textures that the D3D and Glide renderers have. I don't think this will be a major loss since lightmaps are still 32-bit even with this new option enabled (and they're low resolution, so not converting them to 16-bit impacts speed and memory usage less compared to the other textures that are converted). Also, with the coronas and skyboxes that tend to take the largest quality hit with 16-bit textures, the dynamic scaling code may have made little to no difference due to them often having a wide dynamic range (or more specifically, high maximum color values).

I ran a few benchmarks on an Intel 865G integrated graphics subsystem with dual channel DDR400. Although it's of course fairly slow, if using the right combination of a low enough resolution and not too many high quality features, it's quite useable. Unfortunately, single pass detail texture mode ran a lot slower. Even though there are some tradeoffs with this mode, I got the feeling that quad texture performance was unusually low for some reason or another. That's too bad because in theory it was supposed to help by trading quad texturing on a larger number of pixels against what is likely to be more expensive read/modify/write blending on a fairly large portion of these pixels when doing dual texture two pass rendering. Vertex program mode didn't work correctly with the latest drivers, though at least it didn't cause a system lockup. It would have been interesting to see how it compared.

I changed a few other things in this build of the renderer. This includes some tweaks to the vertex programs, cleaning up some old junk in the multipass detail texture code, and various minor general optimizations.

8-15-2004
I don't think there's any easy way to fix the 16-bit z-buffer problems without using a w-buffer. I can sort of half fix it, but it's not really good enough to be of much use, so I'll either leave the new code ifdef'ed out for a bit or just delete it. W-buffers are supported through D3D on some cards, but I've never seen them supported through OpenGL.

3-15-2004
Although I wasn't specifically looking for it, I noticed a minor blending problem with detail textures when not using the single pass mode. It's a subtle line that keeps a constant distance from the viewpoint and is caused by a very minor brightness difference. I doubt I'll take the time to try to confirm it, but I wonder if this might be due to the minor blending bug I've read is present in R300 family ASICs. This anomaly does not show up on my GeForce4.

1-21-2004
It looks like ATI has left hardware gamma ramp support for various full screen OpenGL apps a bit broken in their 4.1 drivers. This problem affects UT OpenGL and first showed up in their 3.10 drivers. Using the start button to switch back to the desktop and then switching back to full screen UT may be able to work around this problem.

11-29-2003
I may add frame rate limiting. V Sync should work though. I just spent an hour or so play testing V Sync at 75 Hz on my primary system and it works just fine. It's an Intel P4, ATI 9800, and WinXP. I also had no problems with V Sync on my old system with an NVIDIA Ti4200.

If you don't want to use V Sync for whatever reason, for online games, just use a lower netspeed of around 10000 or so to prevent UT from running to fast, which can cause it to work incorrectly. For many online games, I'd expect this to make zero difference besides frame rate limiting, as I don't think online servers with a MaxClientRate of over 10000 are very common. Even on servers that do have a higher MaxClientRate, the server would actually have to want to send more than 10000 bytes per second for a netspeed of 10000 to limit anything. I'm not sure what kind of tick rate and gameplay situation would be required for this, as I've never seen it happen. Of course I've probably never played on a server with a high enough player count and/or high enough tick rate, and with a high enough MaxClientRate to ever find out.

ATI's current drivers are not good about sharing the CPU while waiting on V Sync. NVIDIA fixed this a long time ago, though I haven't checked lately if it has stayed fixed. Not sharing the CPU is bad for multitasking performance. On modern PCs, keeping the processor busy doing some sort of spin wait while waiting on V Sync can also increase power consumption.

The frame rate limiting code I've been experimenting with will share the CPU. In this case, whether or not it does so is controlled by the implementation of the sleep function in the UT engine. It could either spin or pass the sleep call down to the OS. The following numbers from whatever my motherboard uses to get CPU temperature are from just running around CTF-Coret single player with 16 bots. The game was run at 85 Hz in a window.
With V Sync:~53° C
With V Sync and with frame rate limit of 85:~49° C
Without V Sync and with frame rate limit of 85:~46° C
Under heavier load, there will be less idle CPU time available and these numbers should eventually converge. But even in games with a lot going on, there is still a lot of potential for wasted energy or otherwise useful CPU time on average unless the frame rate is constantly stuck below 85 Hz, or some other set limit.

5-4-2003
I'm probably not going to be fixing that assertion that pops up when switching 16/32 bit color depth in the Video tab in game. For a couple of reasons, it's not easy to fix. It is possible to change this setting by manually editing the ini file of course.

For at least a few releases now, the DLL MSVCP60.dll is required. If you get any error message about trouble finding this file, send me an email. I may upload a copy of it eventually after researching how common it is to have this file installed. In many cases, you'll have it installed already from some other piece of software.

In version 1.5, I added a new option to convert all DXT1 compressed texture to DXT3 format on upload. This can be used to work around bad DXT1 texture quality on NVIDIA Geforce1 - GeForce4 series video cards. The DXT3 textures take twice as much texture memory as the DXT1 textures though. If you are interested in playing around with this setting, take a look at the TexDXT1ToDXT3 option in the [New options] section. If you're looking for a good comparison texture, the sky texture in dm-kgalleon does particularly bad on the NVIDIA cards with the bad DXT1 quality. On the other hand, many of the other DXT1 texture still look very good, so it might not be worth the performance hit in some cases.

This Unreal Developer Network page has some good examples of bad looking DXT1 textures on NVIDIA cards.


Copyright 2002-2006 Chris Dohnal