Friday, February 12, 2021

On the allocation of photon-budgets for light sources.

I am making a real-time Photon Mapper, called Practica Boxel.

Photon Mapping is an algorithm that computes Global Illumination for a scene. This means that unlike traditional real-time rendering techniques, it does not merely compute the direct illumination, but it also considers light that is diffusely reflected from an object, onto another object. This is an incredibly expensive technique, since there are near infinite paths that a photon can take from a lightsource, into your eye. Hence, most GI images are either very slow to compute, or very noisy.

The Photon Mapping algorithm is O(N) in the number of photons (linear.) Or traditionally, even worse than that, as you cannot store a photon in a kD tree in O(1). But my underlying photon storage does have O(1), so in this case, the total algorithm stays at O(N) and thus scales linearly.

Henrik Wann Jensen, the author of the Photon Mapping algorithm, mentions that you can keep the photon count independent from the number of light sources. If you increase the number of light sources in your scene, you do not necessarily need to increase the number of photons that you shoot. So far so good.

But still, I would like to prioritize my light sources, so that the sources that influence the scene illumination the most, shoot the most photons. I need to allocate a photon budget for each light source.

And this brings me to an interesting problem to solve. If my camera roams freely in the world, I need to pick out which lightsources are closest to the camera. These sources are more likely to affect the illumination of the parts in the scene that are close to the camera.

My photon mapper has an additional requirement: for fast indexing, I need my light-source count to be a power of 2. So if there are 9 sources in the world, I actually need to feed my computational kernel 16 sources, some of which are copies. A source that is represented twice can simply be scaled down in brightness, so that both copies combine to the same power output.

It would be beneficial to have the light source furthest away from the original 9, be the one that is not duplicated, and shooting fewer photons (only half of the duplicated ones.)

But I don't have to stop at merely duplication. I could just create 256 light-sources out of the original 9, and do this in such a way that a very nearby light (e.g. a head-mounted light) could get many copies. Let's say 20 of them. And have the street lamp, all the way at the far end of the street be represented by just one. Once fired, those photons have to be scaled of course: even though the street lamp only shoots a few, those that it does shoot, have higher power, whereas our 20-fold duplicated head-lamp shoots much more faint photons. However, the count is really high, so that the noise up-close to the camera is lower.

So here we have my interesting problem:

  • Given N light sources L1..LN...
  • With distances to the camera D0..DN...
  • How to map them onto M (which is 2ᵏ) virtual sources LV1..LVM...
  • So that if source Q is F times further from the camera than source R...
  • It is mapped roughly F times less often onto the LV set than source R.
  • k Can be chosen freely, up to, let's say, 10.

This seems a non trivial problem to solve, so I am leaning towards a simplified version of it, where I classify each original source L as:

  • FARTHEST
  • FAR
  • MED
  • NEAR
And subsequently assign them to 1,2,3 or 4 virtual sources. After adding up the assignments, round up to a power of 2, and distribute the remainder T to the T nearest sources. Scale down the power of each virtual source depending on the representation count.

I'll report back once I have implemented this, and evaluated how it behaves under a roaming camera in a larger scene.

UPDATE

In the end, I went for a slightly different approach: allocate (k * distance) emitters for each lightsource. I do a binary search for a value of k that results in a total emitter count of 128. Done.

Saturday, January 30, 2021

Practica Boxel

Today, I unveiled my latest work-in-progress: "Practica Boxel." Here is the video, introducing the tool, which does WYSIWYG Global Illumination.

In a way, it is a continuation of my SIMD-raytracer of a few years back. But when doing that (on CPU with AVX) I realised that only firing one primary ray for a pixel, and only one shadow-ray, with optionally a specular bounce, will get you limited advantages over rasterizing triangles. Shadows are easier, with no need of kludges like shadowmaps. But that's about it.

The real big win when leaving the rasterizing world, is when you compute indirect light, with diffuse bounces. You can get that awesome soft look from radiosity solutions, e.g. But calculating indirect diffuse is prohibitively expensive for a CPU. For that, you need the power of thousands of GPU cores.

So I wrote CUDA code to do this, leveraging the algorithm by Henrik Wann Jensen: Photon Mapping. A straight up path tracer is too slow for 60FPS use, but a Photon Mapper can be made fast enough, when your scene is simple enough.

So how do we get a simple scene? Use boxes! And thus, Practica Boxel was born. The name is a pun on Magica Voxel, but instead of using voxels, abitrary sized boxes (still axis aligned) are used.

Since no proper box-editing tool seems to exist, I had to write my own. This task is made a little easier when you have Dear ImGui do the heavy lifting w.r.t. user-interface.

So, what does accurate lighting bring to games? Well, for starters, you can prominently feature Darkness in your games. No ambient light, No black screens, but actual darkness, as can be witnessed in this 2 second clip:

Where will this project go next? I could turn it into an indie game, but my preference would be to get external funding to create a complete tool-suite around the technology and let others make the games with it.

Tuesday, January 12, 2021

CUDA on an RTX3070, in an nutshell.

CUDA terminology can overwhelm you. SM, Warp, Core, Thread, Block, Scheduler, Grid. I have been writing CUDA kernels for my RTX 3070, and I thought I would write down what I learned.

First things first: when you write a compute kernel, you write scalar code. So in my case, doing Photon Mapping, my kernel code handles a single ray, and one primitive at a time.

Compare this with writing SIMD CPU code: you explicitly work on arrays, writing vector code, not scalar code. In single-precision floating point operations, that means a vector of 8 (AVX) values at a time, or 16 (AVX512) values at a time.

This does not mean that the hardware executes this as scalar code, though. The GPU will execute your code in batches of 32. This batch is called a warp. Theoretically, warpsizes could be something different than 32, but for GeForces, they are always 32. So in a way, what the CPU does explicitly, 8- or 16-wide, the GPU does 32-wide, implicitly. Warps are executed (by 32 CUDA cores) in sync, much like the lanes of a SIMD register in a CPU. (Caveat emptor: only half the CUDA cores of Ampere GPUs can do integer operations, so throughput is halved for that.)

The workhorses inside an Ampere GPU, are the Steaming Multiprocessors, or SM for short. An Ampere GPU, like the RTX 3070, has SMs that can execute 4 of these warps at the same time. To do this, it has 4 schedulers per SM.

Let's go back to the software side: your kernel. If you have a million photons that you want to trace, your kernel will be executed a million times. In CUDA terminology, there are a million threads of execution. And threads are grouped in blocks. All the blocks together, is the grid. And every block will be assigned to an SM for computation.

When a warp is scheduled and running, the 32 threads in a warp could diverge. If one of the threads is blocked waiting on something, the whole warp is blocked. This is bad. But luckily, the scheduler will intervene, and switch out the warp for a non-blocked warp. Each of the 4 schedulers in a SM can keep up to 12 warps in flight. Often there will be at least one warp ready to run. The net effect is that the latencies are hidden.

When a scheduler has not a single warp that is ready to advance, compute throughput is lost. The NSIGHT Compute profiling tool can detect how often this happens for your kernel.

Even though your kernel is written in scalar form, each iteration of your kernel is still responsible for run-time selecting the right work! That is why nearly every CUDA program will contain the following code:

const uint32_t index = blockIdx.x * blockDim.x + threadIdx.x;
...
foo = bar[ index ];
  

The upshot of all this, is that if your kernel does a = b + c then, provided the memory bandwidth is there, then every SM executes 4x32 = 128 instances of this, and with the 48 SMs of the 3070, that means 6144 instances. So 6144 b values are added to 6144 c values and assigned to 6144 a values.

Note that the best I could hope for on my Xeon W2140B workstation with AVX512 is 8 cores each computing 16 floats, which is 128 instances (or possibly double that, if both execution units can be kept fed.)

Let me conclude by showing what you can do with all the compute power. Here is a scene that is globally illuminated by 2M photons that bounced once. I can compute this in a handfull of milliseconds on the rtx 3070. Not bad!

Tuesday, January 5, 2021

Revisiting Android Studio

Yay, it is the year 12021 HE already. A palindrome year! And this is my first blog post for the year.

So you are still writing Objective-C? Not going with the times, my friend. Or you are still writing C/C++ Android apps? Again, you are not making it easy on yourself. But alas, you are in good company. You are in the company of me. Cheers!

Holding off on Kotlin and Swift can make it a little trickier. For instance, today's excercise is to install Android Studio from scratch on a new machine and build my C/C++ game (it has a little Java too.)

So, the download was easy enough. But I must say the Welcome Screen of Android Studio does not inspire confidence. This is what the welcome screen looks like, when resized. Oops.. not repainting the entire screen.

Seeing that this is cosmetic, let's glance over that and load up my Little Crane project. When I do this, after some loading and downloading, I am greeted by an error:

Gradle sync failed: Cause: com.android.builder.sdk.LicenceNotAcceptedException: Failed to install the following Android SDK packages as some licences have not been accepted.

Again, easily rectified, as the optional NDK needs to be installed first. For this, close the project, and start the SDK manager from the Welcome screen. In the SDK Manager, choose the SDK Tools tab, and select the latest NDK.

Let's try again! This time:

Gradle sync failed: Cause: executing external native build for cmake /home/bram/apps/LittleCrane/AndroidStudio/jni/CMakeLists.txt

Although it does not say so directly, my hunch is that this is a case of the cmake tool not being found, as opposed to something being wrong with my CMakeLists.txt file. Let's try installing cmake from the SDK Manager. I am prompted with two options for the cmake version, but let's try the newest: cmake 3.10.2

WARN - ues.SyncIssueUsageReporterImpl - Multiple sync failures reported. Discarding: SDK_BUILD_TOOLS_TOO_LOW 
WARN - ues.SyncIssueUsageReporterImpl - Multiple sync failures reported. Discarding: SDK_BUILD_TOOLS_TOO_LOW 
WARN - e.project.sync.GradleSyncState - Gradle sync failed: Cause: executing external native build for cmake /home/bram/apps/LittleCrane/AndroidStudio/jni/CMakeLists.txt

Consult IDE log for more details (Help | Show Log) (1 s 373 ms) 
With that IDE log being:
    2021-01-05 11:09:09,260 [ 387004]   INFO - .project.GradleProjectResolver - Gradle project resolve error 
org.gradle.tooling.BuildActionFailureException: The supplied phased action failed with an exception.
	at org.gradle.tooling.internal.consumer.connection.PhasedActionAwareConsumerConnection.run(PhasedActionAwareConsumerConnection.java:58)
	at org.gradle.tooling.internal.consumer.connection.ParameterValidatingConsumerConnection.run(ParameterValidatingConsumerConnection.java:62)
	at org.gradle.tooling.internal.consumer.DefaultPhasedBuildActionExecuter$1.run(DefaultPhasedBuildActionExecuter.java:78)
	at org.gradle.tooling.internal.consumer.DefaultPhasedBuildActionExecuter$1.run(DefaultPhasedBuildActionExecuter.java:70)
	at org.gradle.tooling.internal.consumer.connection.LazyConsumerActionExecutor.run(LazyConsumerActionExecutor.java:87)
	at org.gradle.tooling.internal.consumer.connection.CancellableConsumerActionExecutor.run(CancellableConsumerActionExecutor.java:45)
	at org.gradle.tooling.internal.consumer.connection.ProgressLoggingConsumerActionExecutor.run(ProgressLoggingConsumerActionExecutor.java:61)
	at org.gradle.tooling.internal.consumer.connection.RethrowingErrorsConsumerActionExecutor.run(RethrowingErrorsConsumerActionExecutor.java:38)
	at org.gradle.tooling.internal.consumer.async.DefaultAsyncConsumerActionExecutor.lambda$run$0(DefaultAsyncConsumerActionExecutor.java:55)
	at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
	at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:48)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.run(ThreadFactoryImpl.java:56)
	at java.lang.Thread.run(Thread.java:748)
  

I think it is time to call in the cavalry at stackoverflow for this one. The highest-voted suggestion? Build -> Refresh Linked C++ Projects is greyed-out in my menu. Probably because it was never created in the first place?

At this point, I am tempted to go back to my old machine, with an old copy of Android Studio, and just make sure I do not update it. Because it did work when I last used it, I'm guessing a year ago or so?

I notice that SDK Command Line Tools are not installed. Maybe that's the cause? Adding Android SDK Command-line Tools 4.0.0rc1 to see if that's it. But sadly, no.

Ok, maybe we need to spec a higher Gradle version, per Gradle Plugin Release Notes?

Excellent. It still won't sync, but at least the newer Gradle will print out the exception that occurred! No version of NDK matched the requested version 21.0.6113669. Versions available locally: 22.0.7026061

    * Exception is:
...
Caused by: org.gradle.api.InvalidUserDataException: NDK not configured. Download it with SDK manager. Preferred NDK version is '21.0.6113669'. 
	at com.android.build.gradle.internal.cxx.model.TryCreateCxxModuleModelKt$tryCreateCxxModuleModel$1.invoke(TryCreateCxxModuleModel.kt:138)
	at com.android.build.gradle.internal.cxx.model.TryCreateCxxModuleModelKt.tryCreateCxxModuleModel(TryCreateCxxModuleModel.kt:148)
	at com.android.build.gradle.internal.cxx.model.TryCreateCxxModuleModelKt.tryCreateCxxModuleModel(TryCreateCxxModuleModel.kt:246)
	at com.android.build.gradle.internal.TaskManager.createExternalNativeBuildJsonGenerators(TaskManager.java:1385)
	at com.android.build.gradle.internal.ApplicationTaskManager.createTasksForVariantScope(ApplicationTaskManager.java:183)
	at com.android.build.gradle.internal.VariantManager.createTasksForVariant(VariantManager.java:331)
	at com.android.build.gradle.internal.VariantManager.createVariantsAndTasks(VariantManager.java:207)
	at com.android.build.gradle.internal.plugins.BasePlugin.createAndroidTasks(BasePlugin.java:671)
...
  

Huh... why does it want NDK v21? I can't find any references in my project source insisting on NDK v21. I understand that the default NDK is determined by the Gradle plugin. So let's push the version nr for that up further, to latest: 4.1.0 instead.

Yes! Progress. Now: Minimum supported Gradle version is 6.5. Current version is 6.1.1.

And now the Gradle Sync is successful! Although it still want NDK v21 and not v22. I wonder why? Anyway, I have a new build!

Lessons learned:

  • Always set your gradle version to latest, before doing anything, in the build.gradle file. Mine is now set to 4.1.1. at the time of writing.
  • Copy over your debug.keystore from your old machine, because a fresh copy of Android Studio will create a new one, that is not registered in your Google Play developer console yet.

Monday, November 23, 2020

NVMe

I've built a Xeon workstation, and had a lot of issues with the NVMe drive not being found by the BIOS. It would take roughly 3 reboots, before the WD SN550 would be found.

I've replaced that drive with a Crucial NVMe drive, and since I did that, the boot issue went away. Unfortunately, the Amazon return period has passed, so now I am test-driving the suspect WD SN550 drive as a secondary drive in a PCIe extension card.

With two drives in my system, I can do some comparative performance tests.

The Crucial P1 M2 2280 1000GB:
The Western Digital SN 550:

I will report back with an assessment of reliability as a non-boot drive, for the SN 550. It's read-performance does seem better than the Crucial drive. As I am using the Crucial as boot and root disk, I have not been able to compare the write performance between then. The Crucial does indeed feel slower.

UPDATE: The SN 550 is reliably detected as a 2nd drive by linux.

Thursday, November 12, 2020

Setting up a new Ubuntu box.

This is for my own benefit... when setting up a new Ubuntu distribution, adjust the following...

  • Choose minimal install, and the create an EFI partition, a Swap partition, and a Root partition.
  • After reboot, pin the terminal to the bar.
  • gsettings set org.gnome.desktop.interface enable-animations false
  • apt-get update ; apt-get dist-upgrade ; apt autoremove
  • Make it accessible: apt-get install openssh-server
  • Booting should not be silent, edit /etc/default/grub
  • Copy over .ssh/ directory from another machine, so that I have my ssh keys.
  • Copy over .vim/ directory from another machine, so that I have my vim setup.
  • Add to .bashrc file: export PKG_CONFIG_PATH=$HOME/lib/pkgconfig:/opt/lib/pkgconfig
  • Add repositories for dbgsym packages.
  • To allow fan control: sudo nvidia-xconfig -a --cool-bits=4
  • apt-get install psensor traceroute imagemagick
  • apt-get install vim git cmake clang-10 libsdl2-dev opengl-4-man-doc
  • apt-get install inkscape gimp wings3d
  • Install CUDA
  • Set git identity: git config --global user.email "EMAIL" ; git config --global user.name "NAME"
  • Get rid of Gnome's indexing by tracker.
  • Create a file /etc/modprobe.d/nsight containing: options nvidia "NVreg_RestrictProfilingToAdminUsers=0"

Thursday, November 5, 2020

Screen Recording and Ubuntu.

Recording gameplay videos on Ubuntu can be a big frustration. Here are some lessons learned.

WM

Compositing window managers are garbage. Get rid of the default window manager 'Mutter' and go with 'Openbox' instead. GDM3 refuses to launch a different windowmanager, so replace that too, with lightdm. *sigh*

  $ sudo apt-get install lightdm
  $ sudo apt-get install openbox

Log off, select openbox, and log on again. Good riddance to the crapfest called compositing. You don't need it. Eye candy and transparent windows are for noobs.

Simple Screen Recorder

So far, the best results I get with Simple Screen Recorder, but with caveats: The docs say OpenGL recording is best. Well, it most certainly is not. It is choppy, glitchy and unreliable. It also causes OpenGL errors. Instead, record the window. Make sure to select superfast setting, otherwise the encoding will cause hitches in the framerate, especially so when trying to make 60Hz videos.

In the end, I got there, and managed to record gameplay for my video, I made with Openshot: