threads

###VERTEX THREAD AND FRAGMENT THREADS
I said in the above paragraph that I used one vertex processing thread and multiple fragment processing threads, and I gave my reason. If it does not ring a bell, please go back and read it again.

The vertex processing phase starts from the PuresoftPipeline::drawVAO() function, and runs in the caller’s thread. After it gets all the scanline information from the rasteriser, and prepared interpolation start and step data for all scanlines, it makes fragment processing tasks for each scanline (one scanline per task) and dispatch the tasks into the task queues of the fragment threads. The following diagram shows this relationship.

One fragment processing task contains drawing information for only one scanline. For example, if a triangle is rasterised into 10 scanlines, the vertex thread will send 10 fragment tasks to the fragment threads.

In order maximum parallelism, when thinking of the thread model, I have three considerations in my mind. (or least three I can remember at this moment)

####MINIMUM SYNCHRONIZATION
No two fragment threads can work on the same fragment (pixel) at the same time. Because if so, the fragment must be exclusively locked by one thread at one time, or the depth test undoubtedly will not correctly work. But if the threads are synchronized, at least at the fragment-updating point, they have to work in turn. The parallelism will be hurt.

####AVOID FALSE SHARING
No two fragment threads should update two fragments that are too close to each other. In other word, two fragments being updated at the same time must have a distance between them of at least 64 bytes. Otherwise, false sharing would probably happen, hurting parallelism.

####MINIMUM KERNEL-MODE SLEEPING
Although it is not very likely to happen during an on-going draw-call, but when a fragment thread has no task to do due to empty queue, the fragment thread should avoid kernel-mode sleep as much as possible, at least during a draw-call. You must know the reason --- switching to kernel-mode sleep costs thousands of CPU cycles, which is too expensive. (not sure how it would be on Linux, but It is so on Windows)

The first two concerns are quite easy to resolve. The tasks are dispatched according to their scanline number --- their y coordinate in screen space. To which thread a task is sent is determined by:

Thus, tasks on a certain scanline will only be sent to one certain fragment thread. In other word, different fragment threads are sure to be working on different scanlines. As a result, fragment threads are totally unnecessary to be synchronized at any time, and the possibility is very low that two threads update two fragments that are too close. (but it still could happen when one thread is updating fragment near to the right end of scanline y while another thread is updating fragment near to the left end of scanline y+1)

For the third concern, I guess you know how I would do it --- using spin-wait instead of synchronization object. The queue’s pop code is like below.

T* beginPop(void)
{
	while(true)
	{
		if(0 == m_len)
		{
		}
		else
		{
			break;
		}
	}

	return m_queue + m_out;
}

void endPop(void)
{
	if(LEN == ++m_out) // it’s a ring queue…
	{
		m_out = 0;
	}

	InterlockedDecrement(&m_len);
}

Please notice that spin-wait is also the reason why I have to leave one thread exclusive to vertex thread instead of sharing it --- spin-wait never yield CPU core, making other thread who tends to share that core hang forever (well the truth is not hanging, but having very small chance to get a time slice to run).

However, the reality is always complicated. There are some cases that the vertex thread wastes much CPU resource while it is exclusively occupying a CPU core. It is not difficult to think of such case --- suppose you are drawing a rectangle filling large portion of the scene. For example, how about drawing a skybox? In such case, a big rectangle is usually formed by only two big triangles, which lets the vertex thread finish its work too early. See the following chart.

To resolve this problem, I designed an optional work mode for the vertex processing thread. When the vertex thread is dispatching fragment tasks, it takes its own thread into account and sends a same share of tasks into its own task queue. After the vertex thread finishes its job, it continues to process the tasks in its own queue --- re-dispatch to a free fragment thread or process the task by itself. The behavior is like the following pseudo code.

Vertex_thread()
{
    For each scanline
    {
        Dispatch fragment tasks to the queues of fragment threads, including the extra queue of its own.
    }

    For each tasks in the extra queue
    {
        Dispatched = false
        For each fragment thread
        {
            If the thread is free
            {
                Dispatch fragment tasks to the thread
                Dispatched = false
            }
        }

        if(not Dispatched) // if all fragment threads are still busy
        {
            Process the task
        }
    }
}

I tested several scenes and found out that this work mode does benefits for the ‘big-rectangle’ case, but for some cases it introduces overhead. So I finally added an option argument to the draw-call and let the client programme’s developer make choice. And they can choose by workload analysis as well as by experiment.

RETURN TO PIPELINE

RETURN TO LETS-SET-OFF
RETURN TO INDEX

You have reached the last page of this WIKI. Got something to say to me? Free free to write to me at: [email protected], or [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

threads

Clone this wiki locally