Concurrency the little bitch

Hey folks, todays post will feature some ideas, advises and conclusions about task-based concurrency programming (mostly in C++).

First of all, what actually is task-based concurrency? If it comes to threading there are basically two and a half approaches you can choose from: multithreading, multitasking, – multiprocessing.

Multithreading: This approach relies on multiple threads on which the developer can spread his workloads to reach an even resource usage and optimal performance. This approach is very low-level and requires a lot of knowledge and care of the person who is coordinating the threads. He always have to know: How many threads do I have, which threads are doing what and also (he better knows) where his threads are running. (On which cpu core for instance)

Multitasking: This one is a more high-level approach. There are some ways to get things done here. The first way: thread pooling. The developer initializes a pool of threads which keep active after they finished their work so he can pass them tasks and don’t have to create new threads all over. He (usually) don’t knows which thread is executing which task (and he also doesn’t care about it) since he just queues tasks up in the pool and the pool decides where they are going. The Second way: One could use wrappers like C++ std::async. Asyncs are (kind of) used like a thread pool. There is simply one difference. Not the developer but the “std::async framework” decides where or how the tasks are executed. If you hand a task to a std::async it could also be executed on the same thread where the result is required until you don’t explicitly tell the async object to behave another way.

Multiprocessing: You spawn multiple processes and let them communicate with each other. You are (kind of) running multiple programs wich do multiple tasks.

I will focus on multithreading/tasking in this post since I have little to no experience with multiprocessing.

Okay, this might sound like you have very little differences from multithreading to multitasking with the exception that multithreading seems to be more work. And in some way, this is actually not to far away from reality. But, and here is the point, there are reasons for both to exist.

A Realworld example for multithreading:

MessageLoop and GUI/Game thread. Let’s consider we have the following code:msgloop_m

We have our main thread which is basically our main function and we have our game thread, which is spawned and will run until the main thread exits it. The only communication between the two threads is managed through the function calls on the game object and the join on the game thread.

And an example for multitasking:update_m

We have our game thread which does some logic and which is containing the game loop ( run() ) and a thread pool which is executing little independent tasks (in this case updating actors) which won’t report back to the game thread.

Alright, let’s see: On our msgloop/gamethread example exactly know which thread is dealing with which piece of work (main() with the msg loop, gameThread with the gameloop). But how does is look for the update example? Well, until we try to figure out which thread of our pool is executing which task we would have to call any kind of thread identification from our task, so simple we don’t know who is who.

And these two examples are good for two things: Giving you guys an example where which approach may fit and giving you an example how these approaches are working.

In example one each thread executes some work completely independent from each other and sometimes they communicate to trigger actions on the other one which is basically a simple event handling pattern.

In example two one thread is creating tasks, the others are just “consuming” the tasks the game thread generates. That’s why it’s called the producer-consumer pattern.

Both of those patterns are suited for different use cases as you might recognize. The first pattern is very handy if it comes to things like controls or long-term work wich can be handled parallel or any kind of input loop or audio output, the producer-consumer pattern on the other hand is better suited if you need some small pieces of independent work to be done in parallel to the work of your long running other threads, like drawing, animation updates or simple outputs.

The p-c-pattern benefits a lot from lock free tasks but suffers even more if you need to synchronize the tasks you are pooling, since if you are locking a pool thread, you also lock some future tasks from being executed while you can easily create a new worker thread if you are managing your thread lifetimes on your own.

The best performance you can reach with a thread pool depends on the continuous and spike free stream of tasks you assign to your pool. Let’s display this in a little image:

queue_opt

In this image we got 4 pool threads which are executing work and 4 more tasks which are queued up. The next step would be each thread would grab one tasks and the queue would be empty again (or 4 more tasks could be queued). Pretty optimal isn’t it? All work is getting done fast and clean.

And this is the way you won’t like it:queue_erro

Again, 4 threads, but this time there are 6 tasks scheduled. Now let’s consider 4 are executed, 2 remain in the queue 5 more are getting added. As you might recognize (easy calculation 6 – 4 + 5 = 7) now we got 7 threads in the queue. And that’s the danger with pooling up threads and using task queues. If you throw too much work at a pool which is to small you will generate a growing backlog which will hurt your performance pretty bad. And that’s the reason why the std::async object has the option to spawn additional threads. Basically it’s the hybrid from pool to thread. I personally don’t like it too much since the standard constructor does not guarantee a parallel execution. Nevertheless, with some tricks (like the launch flag) it can get pretty handy since (at least the MSVC++ version) internally relies on a thread pool with the option to spawn a new thread if the pool is under heavy load. I have to admit, I did what each professional programmer would call bad style and implemented my own one with my own rules so I basically know how it will behave since I feel pretty confident with C++ as such but well that’s nothing I can recommend to others…

Ok, that was tasking, threading on the other hand needs a lot of synchronization to work well and sometimes it might be better for your performance since you don’t have to rely on such high-level constructs like async or pools, but well… it’s pain in the ass to get it working as intended. Your parallelization is good as long as you don’t need to access resources from multiple threads but even then, you won’t block other threads from spawning. You always have to remember: If you try to access a locked item, actually your program gets serial instead of parallel. The trick is to get your work lock free and good timed. Try to split task well thought to threads to minimize data sharing between threads and you will se a huge gain.

Usually I would hand you a picture about regular threading now but this time I decided to simply leave it be since in my opinion you can imagine it pretty well your self.

Try this. You got a railway with a crossing in the middle where the rails change the sides. (Don’t ask why, they simply do.) Now you have a train on each of the rails. Well, one of them will have to wait until the other one is passed since they don’t want to crash. And that’s exactly what your locks do. Now imagine you would do those change-of-side very often. The trains would take a lot longer to the end of the track if they would try to go parallel as they would do if they would go one after another since both could go, none of them had to wait.

Super beautiful metaphor of mine with one simple conclusion. Sometime it’s better to go for single threaded application instead of forcing threading into a super complicated program with a lot of locking work to do.

 

I hope this one were helpful to some of you. If you have any ideas, suggestions or criticism, feel free to comment.

Advertisements

It is something

Todays post won’t feature a lot of wise words about technical stuff. Today I am just going to line up some “game” projects I made to give you some hints where to start. (Ofc. that’s not all stuff I made, but it covers stuff from different difficulty levels of development I passed through)

The sprites (and meshes) you see are free ones from different websites, please don’t ask me where I downloaded them, I can’t remember anyway (meshes aside… they are from mixiamo.com.)

The first game project I started has been my biggest mistake in this area… starting with a 3D game while having no idea of DirectX/OpenGL or any other 3D API. For this project I started to learn how to use DirectX with C++ and this almost stopped me from continuing my efforts to get into game development. I never finished it.

3D game

The second game I started were better straight through but still not the best choice I for a beginner since I also used plan DirectX and C++, but the game made it to an actually pretty well working status. (for the beginning)

invaders.gif

After this game I decided to go back to more basic games which are a bit less complicated so I can focus on more Important things like how to build an easy-to-use gamedev framework for minor tasks, and that’s where this attempt came to life:

snake

This very basic snake clone is made with GDI+ and runs in the windows console. I decided to go away from plain DirectX since I got the concept, I got the basics and I wanted to see more progress instead of pumping a lot of time into understanding the API. This snake game later became my multithreading test playground. Actually snake is to simple to use concurrency in the implementation but it were a good starting point to learn what kind of workloads you can spread best over multiple threads.

Later on I continued with this one:

tron

That one were my first step back to actual sprites instead of just drawing pixels. Not much to tell about it, tron is very simple so I were able to concentrate on implement a sprite system into my little framework and start using frame animations.

After tron my next project were a little tetris clone:

tetris

Actually I made this just for fun.

And my latest project in progress is a 2D shooter I am currently working on. Atm. I got dummy sprites which look like shit but which are serving their purpose and I am working on a level from file system so I load complete levels from files.

scroller

Well that’s all I made so far. After the sidescroller attempt I am going to start over with Unity3D for some more serious stuff. Until then I am going to practice with mini games and things like my little physic framework.

I only can show you the mistakes I made in the past. Don’t start to complicated. I can’t tell you often enough. Starting with DirectX and 3D just were too much. Start simple and if you got a feeling how stuff is working you can start to do more complicated projects.

If you feel lost, it’s ok to take one step back to get another 10 steps forward in the future. Just don’t stop trying as long as you have fun doing what you do.

That’s all for now guys.

Of threads and games

Hey guys, today’s post will address threading in video games.

In times where multicore CPUs are pretty common, multithreading becomes a really obvious choice for performance improvements in video games. But multithreading isn’t always the answer. This post is supposed to deal with two common issues I noticed again and again over the last few years.

The first one: Improving (render) thread counts in .ini files of video games to tweak performance.

Well that one might be a bit complicated so I decided to address it first. Sometimes when I read forum entries where people are complaining about bad performing video games, a couple of times someone came up with a lot of .ini tweaks to improve performance. Basically there is nothing wrong with that, but as you might guess already, not all of these tweaks are really helpful. Improving the “render thread count” for instance is not worth it (in most cases). More threads do not automatically mean better performance. Quite the contrary, often this will make things worse.

DirectX x to 11 and openGL aren’t capable of multithreaded GPU/CPU communication (or at lest not very good at it). And that’s one of the core issues. Yes multithreading in games makes totally sense, but only at some very special points and in a “very limited” range. Throwing 32 threads at a game and thinking it will work better is a wrong approach which usually comes from people who have less or no experience with multithreading and/or game development.

Ok back to our DirectX/OpenGL description. I will stick with DirectX in this post since I am more familiar with that, but most of the points will apply to OpenGL in a very close manner.

As I already mentioned: You won’t be able to pass render tasks to your graphics card from multiple threads at the same time. This leads us to a more or less annoying issue.

In a beautiful world, filled with rainbows unicorns and a multithreaded D3D(11)DeviceContext, rendering would work like this:

ParallelDraw

Sadly our context isn’t threadsafe. What does that mean? Well, it’s actually not very complex. As you can see in the picture above, if the world were a better place, we could draw completely parallel but in reality things behave a bit different. If we want to access the D3D11Context, we will have to hide it behind a lock to serialize the access, otherwise our game would crash or at least begin to behave in a strange way. That means our rendering would look like this:

Draw

You might recognize that this will prevent us from taking advantage of our multithreading capabilities in our game, even worse we are facing the threading overhead (yes, spawning tasks and assigning them to threads also takes time) without getting any performance benefits –> our performance decreases. And for the DirectX pros who are laughing at me right now: Yes I am aware of Deferred Context and CommandList but this image is thought to simplify the whole thing for those who have no experience with DirectX since basically you giving tasks to a threadsafe container and execute them all in one thread with an immediate context is the same principe as seriallizing the draw-calls per mutex. The immediate context will execute them one by one anyway. (remember? we are still in DirectX 11)

This might sound like: Multithreading in games is bullshit…. It isn’t. There are other ways to bypass this issue.

Things like animation updates, collision, position updates, sound, physic or even preparing draw-calls can be paralleled very well (that’s why people might have multiple threads for rendering). This is an example how a parallelized frame could be prepared/executed:

frame

This pretty much shows how I am processing my frames (with sometimes more, sometimes fewer threads… it depends on the complexity of my update/collision functions). And yes, sometimes I am doing my collision check twice, first at the beginning of my ->Move() function, one time after my ->Move() function. (One may argue about efficiency but that’s not the topic of this post).

Okay I think this gave you a little insight how multithreading could be implemented in a video game and wich problems you might have, but that’s not all folks.

As a little conclusion you might take the following. There are developers who use multiple threads for graphic tasks (as I already mentioned, to prepare their draw calls), but that’s not the rule (at least as far as I know). And even if there is a render thread counter inside of an .ini file, leave it how it is. In the best case you get 1-5 FPS. In the worst case, the engine can’t handle the increased parallelization workload and your game looses FPS. Developers do not chose this numbers for fun. They know how their engine works (at least I hope so), so they probably will know better how many render threads will be appropriate.

And this leads me to the second issue (I promised you two issues 😉 ):

It’s not always a good idea to throw threads on a game. There are beginners out there who feel forced to use multithreading in games since there are multithreaded CPUs out there. That’s not always a good idea. If you have a game with simple physic, simple collision, low graphics (that does not mean bad graphics), you don’t want to throw multiple threads at it. Take a very basic tetris-game for instance. It could look like this one:

tetris

Why the hell should you throw multithreading at this? Actually you will put a lot of work into getting the same if not worse performance out of your engine just to say: Well it’s multithreaded. I know, this is a really trivial example but it suits my needs. Most likely you won’t need multiple threads in your game until physic enters the field. And that’s another point. I can’t tell it often enough… Don’t use too much threads! Maybe you remember the little physic framework I introduced in my last post. It’s back.

less threads

In the right upper corner you see the FPS. (Yes, it’s running at higher FPS than the gif I uploaded does.) Atm, the engine is running at 4 threads and everything is fine. And this happens if you add another 2 threads to the pool:

more threads

Again, FPS counter upper corner. Well you might recognize that the FPS decreased a little bit. (basically 1 fps and the FPS are at a lower point in the average) One FPS might sound like absolutely no problem (and that’s actually right). But it would be performance you might get for free. And if I would increase the thread count even further, the FPS would decrease even wider. The impact on this example might sound negligible, but you are forgetting that this is a really simple example. The framework is very basic and there never were performance issues by now (unless you triple the square number). If the performance draw would increase, the FPS loss might increase as well. That’s the reason I recommend finding the sweet spot where you get the best FPS with as few threads as possible.

That’s all for now, I hope some of you may be smarter than you were before reading this post.

(C++) An interesting Container design

Okay, this will be one of my first programing posts. I won’t feature a lot of code since I don’t want this to be a how to tutorial. Basically I will present you an idea and give you some thoughts about it (and I will show a little test).

Some time ago I designed a container class (well I named it mango::vector) since the std::vector made a trade I was not willing to accept so I designed a little container class to suit my needs.

Well, here is the situation. A vector is a dynamic array. That means you have a pointer to an array of elements. if the array capacity is reached, the vector allocates a new array which is larger than the original one and copies/moves all the content to the new array. And there is the problem. A vector is fast as long as you won’t bust it’s initial (or reserved capacity) since allocating new memory and moving all the content takes “a lot” of time.

But that’s only on of a few problems. Another flaw comes with erasing or inserting values from/to the vector. If you erase a vector every element after the erased element have to be moved/copied one slot towards the erased elements index which also takes “a lot” of time. (inserting happens to be the other way around –> moving and index away from the insert-index)

Ok, where does that information lead us to you may ask? Well, that’s a damn good question. Some of you will say now “why don’t you just us a std::list?”. Well that’s also a damn good question.

For those who don’t know: If a vector in your memory looks like this:

vector_memory

Then a list can be imagined like this:

list_memory

Ok now you see a memory model, but why?

It’s easy if you want to access a vector element you just shift a pointer to the begin of the array index * elementsize “to the left”. One operation. If you want to add an element at the end (and we suggest we have enough space in our vector so we don’t have to reallocate memory) we just add it. Also one operation.

Let’s look at the list. If we want to add an element at the end we have to allocate memory for a new element and place the element there. Allocation isn’t the fastest thing on earth, that takes some time. If we want to access for example the 4th element we have to drag ourself along 4 pointers. That won’t take to long but it’s still slower than the vector. Inserting into a list is really interesting. If we want to insert an element in our list, we allocate a new one and add it into our list. That’s a lot faster than inserting into a vector as you may remember (no elements to move).

So basically we can break our conclusion into two points. Inserting, deleting or appending elements from/into a list works at a constant speed.

Inserting, deleting elements/from/into a vector is slow, appending elements to the end on the other hand is really fast.

Each of the above mentioned containers do have their ups and downs, so why no container wich is good at both? And that’s exactly the conclusion I came to when I started to think about that topic. This wouldn’t be an interesting post if I hadn’t a wonderful stupid idea to solve that particular problem (kind of)

semivector

That’s the idea I came up with. Simple said, I made a “vector” with 2 arrays, one with elements, one with pointers to the elements.

What is the advantage of that? Well that’s easy. If you want to access an element you access it over the array of element pointers. That’s not significant slower than accessing them the regular way since you just have to access the array of pointers.

Here comes the interesting part. If you want to delete an element for example (let’s take the second one) you only have to shift the pointer to the second element to the end of the pointer array.

semivector_deleted

Now if you want to access the “second” element in your vector you’re actually accessing the third element in your array, but the second pointer (you won’t spot the difference).

(that’s how the “vector” would look like after an insert)

semivector_inserted

So what is the benefit of this? Also easy: If you make an insert/erase you won’t have to shift the actual elements, instead of you are shifting the pointers to the elements. If you have a custom::vector you will see no improvement in performance compared to a std::vector (the performance will even get worse) since shifting int variable is as fast as shifting pointer. On the other hand if you got a custom::vector your performance will increase significant since shifting pointers is a lot faster than shifting std::strings.

Most likely you will ask now: If this is the holy grail of a container, why isn’t vector implemented that way?

Legit question. The answer is simple. This container has other flaws. Yes it is fast. But you pay the speed benefit by using more memory. On each element you add there comes a pointer to that element. Lets consider we have a 32 Bit system: If you have 200 elements with size 8 bytes we got 1600 bytes (+ ca 16 bytes for the actual std::vector object) for our std::vector. If we have 200 elements which 8 bytes in a custom::vector we will have 2400 Bytes of memory in use (if we consider a 32 bit pointer is 4 bytes). That’s a third more. I still like to use my custom container since in my opinion in the most cases the size difference won’t have to much of an impact since memory isn’t always the limiting factor. The standard container a build to be flexible and with things like embedded systems in mind. It’s not always the raw execution performance that matters.

Well now I threw around with a load of claims and proofed nothing by now, funny aye? But this wouldn’t be fun if I hadn’t tested this at least a bit. (Just for those of you who are curious, no I won’t provide the code for my vector since I did some ugly thing down there which I want nobody to know about, sorry. But you are clever guys and I am pretty sure you will be able to implement this one yourself)

This test only shows the insert capabilities of this container, erasing won’t perform much different. The test is not perfect but it will give you a relation of numbers which will at least give you a feeling for the speed improvement.

here is the test code:

vector_test_code

and here comes the result:

vector_test_result

These results are in seconds, maybe I should have written that. The custom::vector (in my case mango::vector) inserts 100x faster (if you believe this test). That’s a number isn’t it? Well, reality seems a bit difference. The number will change, but in both directions. The bigger the elements in your vector are, the faster your custom::vector will get in relation to the std::vector. The smaller the elements are… well you got it I think.

The test was built in release with activated optimizations.

If you actually made an implementation for this custom container which is clean enough to share it with the world feel free to link it in a comment.