Rasmus' blog
Graphics and game development.
Home All posts About Follow (@rasmusbarr) RSS
JUNE 17, 2017

Data-oriented rigid body physics

This post discusses my experiments with making a tiny (single cpp-file) data-oriented and SIMD-optimized 3D rigid body physics library. You can find the code with a sample application (win/mac/linux) here. The library allocates no memory by itself and holds no hidden state. If you are interested, I'll discuss the process and design decisions below.

I will not discuss the mathematical details of collision detection and physics because there are other sources for that information. Instead, I will be very specific and describe the iterations that the library have gone through to this point. In addition, I'll describe some specific optimizations and future work at the very end.

Library overview

Very broadly speaking, the purpose of a physics library is to move objects around and having them not penetrate each other, while fulfilling other constraints such as joints. To this end, it first determines all collisions between objects and generates a set of contact points. The contact points are then used as constraints on the objects' velocities, in addition to any joints, before advancing their positions. In practice, collision detection is divided into a broadphase and a narrow phase. The broadphase determines pairs of potentially colliding objects, e.g., using their bounding boxes, and the narrow phase produces actual contact points. The constraint solving part is based on sequential impulses, which I first saw in a GDC presentation by Erin Catto several years ago.

The library I made certainly has limitations in its current form. The library does not deal with joints explicitly and instead opens up for custom constraints, as will be discussed later. It currently only provides box and sphere colliders (in any configuration per body). It is up to the application to handle complicated world geometry, such as a triangle soup, by adding custom contact points.

Iteration 1: Buffer in, buffer out

The main idea of making a data-oriented library is, at least from my point of view, to communicate through data buffers in an efficient format:

This simplistic view ensures that the library holds no hidden world state. Instead, state is clearly visible to and managed by the user, allowing, e.g., the current physics state to be serialized with a few memcpy's. Contrast this with many physics libraries that has large class hierarchies and carries around a complicated world state.

I started by defining the basic data required to do rigid body simulation. The most important data for the physics library holds the state of rigid bodies, and the arrangement of colliders. Colliders are attached to rigid bodies with a relative transform, and multiple colliders can be attached to the same body. A body has a mass, a moment of inertia, a velocity, and an angular velocity. This information resides in user managed memory with a mixed SoA/AoS layout based on access patterns. The user is free to swap elements for efficient deletion, or add multiple objects at once using memcpy.

In C/C++ code, the data format was as follows:

struct ColliderData {
struct {
BoxCollider* data;
Transform* transforms;
uint32_t count;
} boxes;
struct {
SphereCollider* data;
Transform* transforms;
uint32_t count;
} spheres;
};
struct BodyData {
Transform* transforms;
BodyProperties* properties;
BodyMomentum* momentum;
uint32_t count;
};

The BoxCollider/SphereCollider structs simply hold information about the geometry of the collider (size/radius). BodyProperties holds mass and moment of inertia while BodyMomentum holds linear and angular velocity. The Transform struct describes both the transform and a parent body (only valid for colliders).

Given these data definitions, the library simply had two functions in the beginning:

void collide(​ContactData* contacts, BodyData bodies, ColliderData colliders, Arena temporary);

void simulate(​ContactData contacts, BodyData bodies, float dt, Arena temporary);

The first function, collide, takes all colliders and determines contacts between them. The contacts are stored in an output buffer, allocated by the user. Body data is needed in order to find the absolute transform of each collider, which is specified relative to the body it is attached to. The second function, simulate, takes all contacts and corrects the linear and angular velocity of each body so that contacts aren't penetrating. Last, the position of each body is advanced using the time step. Forces, such as gravity, is applied by the user as a velocity impulse before calling simulate.

The split between the functions makes sense because it allows the user to add additional contacts, or filter contacts out. The added flexibility, and what it can accomplish, is something I may talk about in a future post.

A note on the last parameter to the functions: I have found that you typically need temporary buffers when doing data-oriented programming. This is simply because you operate on many objects at the same time, and storing thousands of data items on the stack isn't ideal. The temporary arena, passed to each function, is simply a memory pointer and an integer holding the size of the memory that the physics library can use to store temporary buffers within the call, thus avoiding dynamic memory allocation.

Iteration 2: Contact cache

After the first iteration I was able to simulate thousands of objects at 60 fps on my 15W laptop. Next, I wanted to improve the stability when objects are stacked. Warm starting is a way to do that, and requires some state between simulation updates.

The simulate function corrects the velocity of bodies by iteratively applying corrective impulses between bodies in contact. A higher number of iterations means a more stable result. Warm starting involves starting out by applying the impulse from the previous simulation update. This typically results in increasing stability across simulation updates, as long as bodies stay in contact.

Two things are needed to implement warm starting:

In order to match up contacts between simulation updates, it is common to store which features between two colliders each contact comes from. For example, if the contact point is from an edge-edge contact between two boxes, the two edges would form a unique feature pair. Physics engines may then store a persistent manifold for each pair of colliders across simulation updates in some kind of world state. In the interest of avoiding hidden state, I instead define a very specific contact cache:

struct ContactCache {
uint64_t* tags;
CachedContactImpulse* data;
uint32_t capacity;
uint32_t count;
};

Warm starting impulses stored in this contact cache are identified with 64-bit tags that are used to match up contacts. For efficiency reasons the tags are guaranteed to be in ascending order.

The cache tag is a combination of two collider identifiers and an identifier for the feature pair. Because we allow the user to rearrange colliders as necessary in the input data, it's not reliable to use indices into the collider data arrays directly. Instead, I opted to add a 16-bit tag to each collider that the user manages explicitly. It is thus the user's responsibility that there are no duplicate tags, and that tags aren't reused before the cache is invalidated (by the next call to collide). The collider data was extended with tags as follows:

struct ColliderData {
struct {
uint16_t* tags;
BoxCollider* data;
Transform* transforms;
uint32_t count;
} boxes;
struct {
uint16_t* tags;
SphereCollider* data;
Transform* transforms;
uint32_t count;
} spheres;
};

The user-managed tags provide some added flexibility too. Contacts can be explicitly invalidated in the cache when a collider is changed by updating its tag, for example. Memory for the cache is allocated by the user and the new library interface came down to this:

void collide(​ContactData* contacts, BodyData bodies, ColliderData colliders, Arena temporary);

void simulate(​ContactData contacts, BodyData bodies, ContactCache* cache, float dt, Arena temporary);

Iteration 3: Rest

After the previous iteration, stability was good enough for my purposes. However, I wasn't satisfied with the fact that the CPU load was the same no matter if all objects were moving or at rest.

Resting can be handled by detecting groups of objects that have been immobile for some time, and then excluding them from further computations. For this reason, I add a uint8_t counter to each body in order to track how long each body have been below the movement threshold, like so:

struct BodyData {
Transform* transforms;
BodyProperties* properties;
BodyMomentum* momentum;
uint8_t* idle_counters;
uint32_t count;
};

A counter is incremented each simulation update the corresponding object is below the movement threshold, and cleared if above. What remains is determining connected sets of objects, so called islands, and exclude islands where all objects are at rest, i.e., where the counter has reached a certain value. I detect islands in two steps. First directly after the broadphase, allowing detailed collision detection to be skipped for many resting objects. Then again after detailed collision detection.

This results in a slight complication of the library interface since collisions determine the set of active bodies, and that set needs to be communicated in yet another buffer:

void collide(​ActiveBodies* active_bodies, ContactData* contacts, BodyData bodies, ColliderData colliders, Arena temporary);

void simulate(​ActiveBodies active_bodies, ContactData contacts, BodyData bodies, ContactCache* cache, float dt, Arena temporary);

A simulation update is now restricted to the set of active bodies, and the user should only apply gravity and other forces to objects within the active set. If a body should be kept active and simulated even when resting, its counter must be cleared every frame by the user.

Iteration 4: Extending without callbacks

At this point I wanted to allow for integrating custom joints in user code. A nice thing about a buffer-based interface is that it is possible to extend the physics library without any kind of plug-in architecture. We have already seen an example of that with collision detection and how more contacts can be added to the contact buffer before the simulation update.

I wanted to keep the simplicity of having user code dictate control flow, and thus avoid callbacks from the library, while still allowing custom joints. To this end, the simulation update function was split into multiple parts ranging from constraint solving to position advancement.

I first separated out reading from and writing to the contact cache. The two relevant functions are found below. This is really in preparation for being able to work on a subset of contacts at a time while having a single cache, and can be ignored for now.

ContactImpulseData* read_cached_impulses(​ContactCache contact_cache, ContactData contacts, Arena* memory);

void write_cached_impulses(​ContactCache* contact_cache, ContactData contacts, ContactImpulseData* contact_impulses);

The rest of the simulation update is really a sequence of events where a setup phase gathers data needed for efficient contact constraint solving, an iteration phase iterates a number of times to fulfill the constraints, a caching phase where the cached impulses are updated, and last, an advancement phase where bodies are moved. The meat of the simulate function was thus broken up into the following:

ContactConstraintData* setup_contact_constraints(​ActiveBodies active_bodies, ContactData contacts, BodyData bodies, ContactImpulseData* contact_impulses, Arena* memory);

void apply_impulses(​ContactConstraintData* constraints, BodyData bodies);

void update_cached_impulses(​ContactConstraintData* constraints, ContactImpulseData* contact_impulses);

void advance(​ActiveBodies active_bodies, BodyData bodies, float dt);

The function setup_contact_constraints is called once and apply_impulses is repeated several times to satisfy the contact constraints. Adding support for other constraints, such as joints, is now possible by adding more corrective impulses after each call to apply_impulses. Lastly, update_cached_impulses is called to store the latest contact impulses and advance is used to move the bodies.

As you may notice, the result of read_cached_impulses and setup_contact_constraints both return opaque data, but since it's only temporary I do not consider it hidden world state. Both of these functions allocate the returned data from the arena parameter, thus avoiding dynamic memory allocation.

There's still one complication to take care of. In order for resting and wake-up to function correctly, the collision system must know about bodies connected by joints during island forming. Therefore, a body connection buffer was added as input that must contain that information, based on active joints:

void collide(​ActiveBodies* active_bodies, ContactData* contacts, BodyData bodies, ColliderData colliders, BodyConnections body_connections, Arena temporary);

Implementation

In this section, I'll focus on some specific implementation details that may be of interest.

Broadphase

The purpose of the broadphase is to quickly find potentially intersecting pairs of colliders that should be tested in a more accurate manner. My initial approach, before attempting any SIMD optimization, was to make a hash grid and put axis aligned bounding boxes into it to find potential collision pairs. When vectorizing the broadphase, the first thing I tried was to do a brute force O(n^2) double loop in a vectorized manner in a few lines of code. It turned out that for 1000 colliders, the brute force SIMD version was about 2x faster than the more involved hash grid. Scaling as the number of colliders increased was obviously worse, but that knowledge nudged me into the current approach which I will describe below.

Basically I wanted the same code structure as the brute force loop which was easily vectorized, but with better scaling for a larger number of colliders (~8000). I ended up generating a morton code for each collider, and sorting them in ascending order using a radix sort. This resulted in a somewhat coherent linear order for the colliders with very low implementation effort. I then divided this array into chunks of 8 colliders and computed the axis aligned bounding box for each chunk. In a first step, all chunks are tested against each other in a brute force manner, resulting in coarse pairs. For each coarse pair, the contents are then tested per collider, using its axis aligned bounding box, resulting in collider pairs.

Detail collisions

The most enabling part of data-oriented design is the ability to process many objects at once. This makes code easy to parallelize and optimize in most cases. In order to enable this kind of batch processing for detail collisions, all collider pairs from the broadphase was first sorted based on the kinds of colliders involved. This results in arrays of collider pairs sent to the same test.

Box-box collisions are currently the most involved test. I use the separating axis theorem, and in this case axes to test involves face normals and all combination of edges. I divided the test into a few streaming loops for simplicity and performance. First, the faces are tested over all pairs, quickly filtering out most separating colliders. Then, edge-edge axes are tested in a second pass. At that point, edge-edge features are deferred, while face-face tests are performed immediately. Edge-edge contacts are then computed in a separate pass.

Note that none of the current collision routines require any state to be stored between frames. I expect that if support was added for convex polyhedra, I would need to cache information about the previous contacts to get good performance. In this case, I will likely add another explicit cache for that data.

Scheduling of contact constraints

Iterating over contact constraints is one of the most costly parts of the physics engine and vectorizing it is important. Doing so is problematic, however, since loop iterations have dependencies. For example, two subsequent contacts may refer to the same bodies which causes conflicts between SIMD lanes. My approach to solving it is very simple. The contact constraint setup is prefaced with a simple scheduling routine that reorders contacts to ensure that no conflicting contacts end up within a SIMD width.

Future work

The library is by no means finished. Certain parts need more optimization effort. Box vs. box collisions are only 4-wide currently and the other collider combinations are not SIMD-optimized. The broadphase is not great for 4-wide SSE and works much better with 8-wide AVX currently.

The current API makes one fact painfully obvious: Each impulse iteration goes through all constraints, blowing through a lot of memory. It would be much better if we could reduce the working set by performing all iterations on one independent subset of constraints and then move on to the next subset. Part of the problem is finding independent subsets, but most of that work has already been done when identifying islands to support putting bodies to rest. In the future I want to expose functionality to partition constraints and bodies into independent subsets for improved locality and independence.

A related problem is the lack of multithreading. In theory, it would be simple to parallelize the collision routines and, given independent islands for constraint solving, so would updating the bodies. The problem that remains to be solved relates to library design. I would not want the physics library to allocate threads of its own, but rather expose some kind of job interface so it can be plugged in to an existing solution.

The order of contact constraint evaluation seems important for the stability of the simulation. Currently, contacts are always evaluated in tag order, which is probably far from optimal. I briefly tried to randomize the contact order which seems to give good behavior but more experiments and research are needed.