Massive iOS Particle System Performance Gains with Accelerate Framework vDSP

By Gregory Wieber


A little while ago I was working on updating Microcosm and found that I was already at the wall with CPU usage. The bulk of the processing power was being spent on the granular synthesis engine, and the particle system calculations. I used the techniques outlined here to gain enormous speed increases in both the audio and visual processing engines.

Profiling before optimizing is of course a good rule to follow. And, to be honest the vectorized code can be a little harder to grok at a glance, so it's worth keeping your un-vectorized code around as a reference. But, when you need every last drop of performance, I've found vDSP to be indispensable.

The problem: Getting from 3fps to 60fps

As soon as I started adding more interesting physics to the simulation, particularly magnetic attraction, frame rate started to crawl on an iPhone 4. It was really bad, at about 3 fps. Disheartening considering that it took a while just to get the magnetic attraction working.

The reason it was so slow was that each particle in the simulation had to look at every other particle in the simulation. The first thing I did was implement an Octree algorithm. By dividing the space recursively where there are more particles, and then only comparing particles in the same quadrant I immediately got the frame rate up to around 35fps. Later, I would implement a second, coarser Octree (less subdivisions) and would use it to average the magnetic attraction/repulsion of entire areas of space. This gave me a system with both close-range collisions and far range attractions.

Here's a video of the Octree in action, pre-optimizations:


With a decent algorithmic optimization in place, there was still quite a ways to go in order to get the frame rate to where I wanted it.

Enter the Accelerate Framework.

The gains possible with the vDSP functions cannot be overstated. They work on arrays of values, and when implemented correctly they allow the processor to crunch massive numbers of operations in just a few steps.

Here's a simplified look at a typical verlet loop:

  • for (int i=0; i < n_particles; i++)
  • {
  • temp = x;
  • x+= x - oldx + a * fTimeStep * fTimeStep;
  • oldx = temp;
  • }

(See Advanced Character Physics if you need a refresher on Verlet)

The first line stores the current position in a temp variable. The last line uses the temp value to swap the current position and the old position. Whatever the current position was, that's now the old position.

So, the first optimization I made was to move that swap out of the loop.

My Verlet class has a few arrays for holding positions:

  • int n_particle_vertices = n_particles * 4; // xyzw
  • GLfloat particlesX[n_particle_vertices];
  • GLfloat particlesOldX[n_particle_vertices];
  • GLfloat particlesTempX[n_particle_vertices];

I'm storing 4 floating point numbers for each particle: x,y,z,w. These values are simply packed consecutively in each array. Eg:

  • GLfloat firstParticleX = particlesX[0];
  • GLfloat firstParticleY = particlesX[1];
  • GLfloat firstParticleZ = particlesX[2];
  • GLfloat firstParticleW = particlesX[3];
  • GLfloat secondParticleX = particlesX[4];
  • GLfloat secondParticleY = particlesX[5];
  • // etc...

So, now the swap becomes this:

  • vDSP_vswap(particlesOldX, 1, particlesTempX,1, n_particle_vertices);
  • memcpy(particlesTempX,particlesX, sizeof(GLfloat)*n_particle_vertices);

This operation is done once, outside the loop. So, we've turned n_particle assignments into one operation.

The form that most vector to vector vDSP operations follow is that you provide the arrays as well as stride variables, and the number of elements you want to process. A stride other than 1 is used when you're not processing every consecutive element in an array. For instance, if you wanted to subtract all of the y positions of particlesA from all the y positions of particlesB and save the results to yDiff.

Here's what that would look like:

  • int n_y_values = n_particle_vertices / 4;
  • float yDiff[n_y_values] = {0,0};
  • vDSP_vsub(&particlesA[1], 4, &particlesB[1], 4, yDiff, 1, n_y_values);

Couple of things to notice. One, the vDSP operations actually want pointers to the first element of the array that you want to process. In the swap example above, you'll notice that I just provided the name of the array. That works because as you'll remember, array variables are essentially just pointers to their first element. In this case, we're explicitly pointing to the the y position of particlesA and particlesB.

Next, notice that we're subtracting A from B. It might not read like that, because if you read from left to right you might think (looking at the above) that we're subtracting B from A. It's really important to look carefully at the documentation to learn the order in which to provide your parameters to vDSP functions. Sometimes, the docs can be incorrect too -- so if you're out in the woods and getting strange results, that's something to check for.

Our stride is now 4. It's 4 because we're dealing with xyzw vertices -- 4 values. So, it essentially says start at the first y coordinate and then moves 4 values over to the next y coordinate.

The output stride is 1. We're storing all of the subtracted values consecutively in an array called yDiff.

Note:

I've noticed that you can get unpredictable results if you try to store the output of a vDSP operation in one of the arrays that you've provided as a parameter. It's a good idea to use a completely separate array for output. Where might you be tempted to do such a thing? Well, you might be vectorizing a '+=' operation, like ParticlesX+= ParticlesX for instance.

Distance calculation

Here's another real-world example. Not everything can be easily moved outside of a loop. But, there's still improvements that can be made. Here's how you might speed up a distance-to-center calculation:

  • float center[4] = {0.0,0.0,0.0,0.0};
  • float distance_v[4];
  • float distance_dot_prod;
  • float distance_length;
  • vDSP_vsub(center, 1, particleX, 1, distance_v, 1, 3);
  • vDSP_dotpr(distance_v, 1, distance_v, 1, &distance_dot_prod, 3);
  • distance_length = sqrtf(distance_dot_prod);

How did it work out?

Before vectorization the particle system's verlet function was using around 15% of the CPU, and the granular synthesis engine's synthesize function around 35%. By replacing loops and operations with vDSP functions wherever possible, the particle system's verlet calculation is now at around 5% and the granular synthesis engine's synthesize function is around 12%. The frame rate jumped up to 60fps and stayed there even with the magnetic attraction physics added. Overall, probably safe to say that a lot of the various parts of the system became 2 to 3 times faster in many cases.

Here's a rough demo of the new physics. Notice the swirling eddies caused by the magnetic attraction and collisions of the particles:



The granular synthesis saw a big improvement because audio needs to process 44100 samples per second. That's a lot of frames per second. I may write separately about that later, but the biggest bottleneck was accumulating all of the grain samples together, i.e. addition. By simply vectorizing the synthesizer code, I achieved enormous speed gains.

As I said in the beginning of the article, it all comes at the expense of legibility. After you've worked with vDSP for a while it does start to become familiar. If you're programming a game, or something that requires every bit of performance from the processor and you haven't used the Accelerate framework, hopefully this will inspire you to give it a try.

  • Friends and Family can find me on Facebook
  • Follow me on Twitter for sporadic nonsense and occasional insight.
  • Subscribe to my Vimeo page where I post work and bookmark videos I like.