Without fully understanding what is happening under the hood, parallelizing your code can lead to slower performance.

Context

I am working at the moment on evolutionary optimization for neural networks, to train agents to play games. To give a fitness for the neural network, the agent need to play N games (starting from random position), and the fitness is the average score of those N games. These N games are being played in parallel (using multiprocessing library in python).

In this case, I’ve 32 cores available, and I am running 32 games in parallel.

The stack here was simple: python for the code, python-arcade for the game design, pytorch for the neural network, and pygmo for the evolutionary optimization.

The problem

The problem is that the speed of execution. It was abysmal. I was getting individual (aka: 32 games played) in ~5 seconds. This is a very long time for a super simple game. I was expecting something in the order of milliseconds.

In general, especially early on in the experimentation process, a long optimization time is always a sign of bad design, and that I am doing something wrong.

The investigation

My first thought was that it is python-arcade that is the problem. I went deep into the package, and had conversations with some of the developers on the discord channel. My suspicion was geared towards the game clock. Each game needs a ticking clock. The clock speed determines the number of frames per second. My hunch was that in a headless mode though, the same clock is being used, thus the bad performance (basically, in headless mode, I want the clock to tick AS FAST AS POSSIBLE).

I was wrong: there was no clock in the headless mode. The problem was somewhere else.

My next thinking was: it has to be that python-arcade itself is slow. This was a half-ass guess, but in this elimination process, I need to rule it out. So, I built a game engine from scratch (that operates on a matrix of ascii characters). Things can’t get any faster than this.

Run the optimization, and lo and behold….there is almost no gain whatsoever!

Almost ready to surrender, I then converted the multiprocessing evaluation into sequential evaluation, just to set a benchmark and record my observations for later. To my sheer surprise, the was a phenomenal gain in speed!

I need to dig a bit deeper into this (profile the code properly), but I am nearly confident now it is because of how pytorch works: the library is already very efficient in using the different cores. Having many instances of a pytorch library is most likely to choke the available CPU resources, leading to this excessive delays.

I probably need to invest more on my debugging / profiling skills: elimination process is cool and effective, yet bloody expensive.

….hard lesson learned.