pthread-win32 pthread_cond_timedwait is SLOW?

Discussion:

pthread-win32 pthread_cond_timedwait is SLOW?

(too old to reply)

Jan Wielemaker

2004-08-17 10:08:28 UTC

Hi,

I've got a message queue implemented with pthread_cond_timedwait() which
runs both on Linux and MS-Windows (and a lot more :-). In a particular
setup there are two threads exchanging messages through the queue. The
task is purely CPU bound, aimed at testing the queue overhead.

Although I'm not 100% sure yet, I think there is an *enourmous*
difference in preformance in pthread_cond_timedwait() on MS-Windows
(pthread-win32) and the Linux version. Roughly the Windows version
processes 100 messages per second, while the Linux version does
25,000 of them.

Does this sound familiar? Should I guard the queue using native
Win32 primitives? Which?

Thanks for any advice

Cheers --- Jan

Joe Seigh

2004-08-17 10:39:29 UTC

Post by Jan Wielemaker
Hi,
I've got a message queue implemented with pthread_cond_timedwait() which
runs both on Linux and MS-Windows (and a lot more :-). In a particular
setup there are two threads exchanging messages through the queue. The
task is purely CPU bound, aimed at testing the queue overhead.
Although I'm not 100% sure yet, I think there is an *enourmous*
difference in preformance in pthread_cond_timedwait() on MS-Windows
(pthread-win32) and the Linux version. Roughly the Windows version
processes 100 messages per second, while the Linux version does
25,000 of them.
Does this sound familiar? Should I guard the queue using native
Win32 primitives? Which?

Yeah, try the win32 primatives and see what you get. Pthread-win32 is
designed to emulate Posix semantics and performance is secondary.

If you don't get dramatically better performance then you may need to
use some lock-free or fast pathed stuff. Linux mutexes are fast pathed
like win32 critical sections, but the latter don't work in win32 WaitFor...()
calls. There's stuff to get around this but it's not part of win32 or
pthread-win32.

Joe Seigh

Jan Wielemaker

2004-08-17 14:11:45 UTC

Hi Joe,

Post by Joe Seigh

Post by Jan Wielemaker
Although I'm not 100% sure yet, I think there is an *enourmous*
difference in preformance in pthread_cond_timedwait() on MS-Windows
(pthread-win32) and the Linux version. Roughly the Windows version
processes 100 messages per second, while the Linux version does
25,000 of them.
Does this sound familiar? Should I guard the queue using native
Win32 primitives? Which?

Yeah, try the win32 primatives and see what you get. Pthread-win32 is
designed to emulate Posix semantics and performance is secondary.

Fair enough.

Post by Joe Seigh
If you don't get dramatically better performance then you may need to
use some lock-free or fast pathed stuff. Linux mutexes are fast pathed
like win32 critical sections, but the latter don't work in win32 WaitFor...()
calls. There's stuff to get around this but it's not part of win32 or
pthread-win32.

The trick is that my queues can have multiple readers to facilitate
the worker pool model, so condition variables have exactly the right
and clean semantics. Being a lousy Win32 programmer I hoped to get
away using pthread-win32 :-) Anyway, I found this article:

Strategies for Implementing POSIX Condition Variables on Win32
Douglas C. Schmidt and Irfan Pyarali
Department of Computer Science
Washington University, St. Louis, Missouri

http://www.cs.wustl.edu/~schmidt/win32-cv-1.html

Does anyone know whether this is (still) a good state of the art
summary of the issues? Other suggestions?

Thanks --- Jan

Joe Seigh

2004-08-17 14:53:35 UTC

Post by Jan Wielemaker
The trick is that my queues can have multiple readers to facilitate
the worker pool model, so condition variables have exactly the right
and clean semantics. Being a lousy Win32 programmer I hoped to get
Strategies for Implementing POSIX Condition Variables on Win32
Douglas C. Schmidt and Irfan Pyarali
Department of Computer Science
Washington University, St. Louis, Missouri
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
Does anyone know whether this is (still) a good state of the art
summary of the issues? Other suggestions?

The SignalOjbectAndWait solution won't work because auto reset Events
are broken AFAIK, unless Microsoft fixed it, which I somehow doubt
since fixing things has never been a high priority with Microsoft.
The fix is fairly easy and straight forward but we will have to wait
until the solution occurs to the Microsoft architects themselves as
they (like any other groups of software architects) would never
accept an outside solution. So we could be in for a long wait.

What the article doesn't go into is that part of the problem is the
Posix semantics for condition variables as they are defined. If
you go for strict semantics you will be forced into a somewhat suboptimal
solution.

As a practical matter you won't need strict Posix semantics in your
application. But if you implement a condition variable you will need
strict Posix semantics just to shut up all the critics even if you
don't actually claim Posix semantics. Once you go that route you need
to create a complete Posix runtime environment. So it's a little more
work than you might think.

Joe Seigh

Ronald Landheer-Cieslak

2004-08-17 15:54:27 UTC

Post by Joe Seigh
As a practical matter you won't need strict Posix semantics in your
application. But if you implement a condition variable you will need
strict Posix semantics just to shut up all the critics even if you
don't actually claim Posix semantics. Once you go that route you need
to create a complete Posix runtime environment. So it's a little more
work than you might think.

Creating a complete POSIX runtime environment would equate more or less
to writing Cygwin. Doing that from scratch would be a *lot* of work.
Note, though, that Cygwin doesn't quite live up to POSIX where threading
is concerned..

rlc

--
Cygwin: http://cygwin.com

Jan Wielemaker

2004-08-17 18:54:04 UTC

Post by Joe Seigh

Post by Jan Wielemaker
Strategies for Implementing POSIX Condition Variables on Win32
Douglas C. Schmidt and Irfan Pyarali
Department of Computer Science
Washington University, St. Louis, Missouri
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
Does anyone know whether this is (still) a good state of the art
summary of the issues? Other suggestions?

The SignalOjbectAndWait solution won't work because auto reset Events
are broken AFAIK, unless Microsoft fixed it, which I somehow doubt

In what sense? I know there is no point in waiting for M$. I do not
need strict POSIX condition variables, just guarding an FIFO queue
with multiple readers and writers. Condition variables are ideal,
but probably I can be a bit more relaxed on what threads are started
if a new message arrives on the queue.

Cheers --- Jan

red floyd

2004-08-17 19:35:11 UTC

Post by Jan Wielemaker

Post by Joe Seigh

Post by Jan Wielemaker
Strategies for Implementing POSIX Condition Variables on Win32
Douglas C. Schmidt and Irfan Pyarali
Department of Computer Science
Washington University, St. Louis, Missouri
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
Does anyone know whether this is (still) a good state of the art
summary of the issues? Other suggestions?

The SignalOjbectAndWait solution won't work because auto reset Events
are broken AFAIK, unless Microsoft fixed it, which I somehow doubt

In what sense? I know there is no point in waiting for M$. I do not
need strict POSIX condition variables, just guarding an FIFO queue
with multiple readers and writers. Condition variables are ideal,
but probably I can be a bit more relaxed on what threads are started
if a new message arrives on the queue.
Cheers --- Jan

If you're going to end up using the MS Win32 API, then just use a
CRITICAL_SECTION. Much lighter weight than a full mutex.

Joe Seigh

2004-08-17 20:06:08 UTC

Post by Jan Wielemaker

Post by Joe Seigh

Post by Jan Wielemaker
Strategies for Implementing POSIX Condition Variables on Win32
Douglas C. Schmidt and Irfan Pyarali
Department of Computer Science
Washington University, St. Louis, Missouri
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
Does anyone know whether this is (still) a good state of the art
summary of the issues? Other suggestions?

The SignalOjbectAndWait solution won't work because auto reset Events
are broken AFAIK, unless Microsoft fixed it, which I somehow doubt

In what sense? I know there is no point in waiting for M$. I do not
need strict POSIX condition variables, just guarding an FIFO queue
with multiple readers and writers. Condition variables are ideal,
but probably I can be a bit more relaxed on what threads are started
if a new message arrives on the queue.

There's an early version of a win32 condvar I did posted in serveral messages
starting here
http://groups.google.com/groups?selm=3E4CE1EB.2D6C6A86%40xemaps.com

Regarding the comment about HANDLE and PVOID. I found out later that
HANDLE will fit in PVOID. It's doucumented somewhere.

The above is part of a long thread on win32 condvars. There may be some other
win32 condvars being flogged as well.

I have something that's much faster than that which will work with lock-free
data structures as well but it's not been made public.

Joe Seigh

Jan Wielemaker

2004-08-17 20:27:55 UTC

Post by Joe Seigh
There's an early version of a win32 condvar I did posted in serveral messages
starting here
http://groups.google.com/groups?selm=3E4CE1EB.2D6C6A86%40xemaps.com
Regarding the comment about HANDLE and PVOID. I found out later that
HANDLE will fit in PVOID. It's doucumented somewhere.
The above is part of a long thread on win32 condvars. There may be some other
win32 condvars being flogged as well.

Thanks. Interesting thread. Together with the mentioned article this
should give me a good start.

Post by Joe Seigh
I have something that's much faster than that which will work with lock-free
data structures as well but it's not been made public.

There are more things slowing down the system. Lets first see whether
this bottleneck disappears using one of the outlined approaches. There
is a big gap between 100 (M$) and 25,000 (Linux) exchanges per second.

Thanks for your and others input!

--- Jan

Stewart Adeyemo

2004-08-17 22:13:33 UTC

Jan,

If performance is an issue, and you're prepared to code natively for Win32,
then IOCompletion ports are the way to go (see, for example,
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/ht
ml/progthrepool.asp - although C# based, the API is standard Win32 and is
available to C/C++). This is "the" way to implement thread pools (or event /
message queues) for optimal performance on Win32.

Regards

Stewart

Post by Jan Wielemaker

Post by Joe Seigh

Post by Jan Wielemaker
Strategies for Implementing POSIX Condition Variables on Win32
Douglas C. Schmidt and Irfan Pyarali
Department of Computer Science
Washington University, St. Louis, Missouri
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
Does anyone know whether this is (still) a good state of the art
summary of the issues? Other suggestions?

The SignalOjbectAndWait solution won't work because auto reset Events
are broken AFAIK, unless Microsoft fixed it, which I somehow doubt

In what sense? I know there is no point in waiting for M$. I do not
need strict POSIX condition variables, just guarding an FIFO queue
with multiple readers and writers. Condition variables are ideal,
but probably I can be a bit more relaxed on what threads are started
if a new message arrives on the queue.
Cheers --- Jan

Stewart Adeyemo

2004-08-17 22:37:23 UTC

Jan,

Just realised the .NET example / documentation may not be as illustrative as
I had initially thought, so here's some C++ code snippets that should be
more helpful.

void Queue::initialise()
{

// ...

// NB. completion key can be any random key (I use zero)

const DWORD COMPLETION_KEY = 0;

// create completion port

mCompletion = CreateIoCompletionPort(INVALID_HANDLE_VALUE, 0,
COMPLETION_KEY, ACTIVE_THREADS);

// ...

}

void Queue::terminate()
{

// ...

// free completion port (NB. any unprocessed tasks are lost / leaked)

CloseHandle(mCompletion);

// ...

}

void Queue::add(Task *task)

{

// ...

// post task to our completion port

DWORD key = (DWORD)task;

DWORD bytes = sizeof(key);

BOOL ok = PostQueuedCompletionStatus(mCompletion, bytes, key, 0);

// ...

}

Task * Queue::get()

{

// ...

DWORD bytes = 0;

DWORD key = 0;

LPOVERLAPPED overlapped = 0;

// retrieve pending task from completion port

BOOL ok = GetQueuedCompletionStatus(mCompletion, &bytes, &key, &overlapped,
INFINITE);

// ...

}

Regards

Stewart

Post by Stewart Adeyemo
Jan,
If performance is an issue, and you're prepared to code natively for Win32,
then IOCompletion ports are the way to go (see, for example,

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/ht

Post by Stewart Adeyemo
ml/progthrepool.asp - although C# based, the API is standard Win32 and is
available to C/C++). This is "the" way to implement thread pools (or event /
message queues) for optimal performance on Win32.
Regards
Stewart

Post by Jan Wielemaker

Post by Joe Seigh

Post by Jan Wielemaker
Strategies for Implementing POSIX Condition Variables on Win32
Douglas C. Schmidt and Irfan Pyarali
Department of Computer Science
Washington University, St. Louis, Missouri
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
Does anyone know whether this is (still) a good state of the art
summary of the issues? Other suggestions?

The SignalOjbectAndWait solution won't work because auto reset Events
are broken AFAIK, unless Microsoft fixed it, which I somehow doubt

In what sense? I know there is no point in waiting for M$. I do not
need strict POSIX condition variables, just guarding an FIFO queue
with multiple readers and writers. Condition variables are ideal,
but probably I can be a bit more relaxed on what threads are started
if a new message arrives on the queue.
Cheers --- Jan

Jan Wielemaker

2004-08-18 08:24:34 UTC

Stewart,

Post by Stewart Adeyemo
Jan,
Just realised the .NET example / documentation may not be as illustrative as
I had initially thought, so here's some C++ code snippets that should be
more helpful.

Thanks for the hint. I do not think this is useable for me though.
The system under consideration is SWI-Prolog. Besides multiple readers
and writers, extraction by readers is based on Prolog unification, For
non-Prolog people, the readers (may) specify a pattern. If a new
message (=Prolog term) arrives on the queue, one of the readers
for which the pattern matches gets the message. The others remain
waiting.

In many practical situations there is no pattern involved or there is
only one waiter. Therefore with the queue I record whether or not
there is a waiter with a pattern. If not, I use pthread_cond_signal()
to wake up one waiter. If there is a pattern I use
pthread_cond_broadcast() to wake up all of them and let them fight for
the message.

The IoCompletionPort stuff doesn't seem to support this type of queue
easily.

Cheers --- Jan

Post by Stewart Adeyemo
void Queue::initialise()
{
// ...
// NB. completion key can be any random key (I use zero)
const DWORD COMPLETION_KEY = 0;
// create completion port
mCompletion = CreateIoCompletionPort(INVALID_HANDLE_VALUE, 0,
COMPLETION_KEY, ACTIVE_THREADS);
// ...
}
void Queue::terminate()
{
// ...
// free completion port (NB. any unprocessed tasks are lost / leaked)
CloseHandle(mCompletion);
// ...
}
void Queue::add(Task *task)
{
// ...
// post task to our completion port
DWORD key = (DWORD)task;
DWORD bytes = sizeof(key);
BOOL ok = PostQueuedCompletionStatus(mCompletion, bytes, key, 0);
// ...
}
Task * Queue::get()
{
// ...
DWORD bytes = 0;
DWORD key = 0;
LPOVERLAPPED overlapped = 0;
// retrieve pending task from completion port
BOOL ok = GetQueuedCompletionStatus(mCompletion, &bytes, &key, &overlapped,
INFINITE);
// ...
}
Regards
Stewart

Post by Stewart Adeyemo
Jan,
If performance is an issue, and you're prepared to code natively for

Win32,

Post by Stewart Adeyemo
then IOCompletion ports are the way to go (see, for example,

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/ht

Post by Stewart Adeyemo
ml/progthrepool.asp - although C# based, the API is standard Win32 and is
available to C/C++). This is "the" way to implement thread pools (or event

/

Post by Stewart Adeyemo
message queues) for optimal performance on Win32.
Regards
Stewart

Post by Jan Wielemaker

Post by Joe Seigh

Post by Jan Wielemaker
Strategies for Implementing POSIX Condition Variables on Win32
Douglas C. Schmidt and Irfan Pyarali
Department of Computer Science
Washington University, St. Louis, Missouri
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
Does anyone know whether this is (still) a good state of the art
summary of the issues? Other suggestions?

The SignalOjbectAndWait solution won't work because auto reset Events
are broken AFAIK, unless Microsoft fixed it, which I somehow doubt

In what sense? I know there is no point in waiting for M$. I do not
need strict POSIX condition variables, just guarding an FIFO queue
with multiple readers and writers. Condition variables are ideal,
but probably I can be a bit more relaxed on what threads are started
if a new message arrives on the queue.
Cheers --- Jan

Stewart Adeyemo

2004-08-18 12:15:17 UTC

Jan,

You're correct, IOCompletionPorts do not readily support your requirements.
While you could build such support on top of them, that half defeats the
purpose of using them in the first place :-) I've done something similar in
the past where I really had to know just what tasks remained unprocessed at
shutdown. Still, and as SenderX pointed out elsewhere, IOCompletionPorts
have other benefits (inc. the ability to cooperate with the scheduler to
manage workload)

Beware, though, of having threads contend unnecessarily for messages
following a broadcast; you become susceptible to lock convoys on the mutex
and can severely limit scalability.

I know next to nothing about Prolog or its patterns, so bear that in mind
when considering the following comments / thoughts:

a) Can the "patterns" be considered keys (or hashed to a key) and are the
possible patterns bounded ?
b) Your recording of whether or not waiters with a pattern exists, suggest
your condition var is being used for multiple predicates; I suspect your
patterns are, in fact, your predicates. This would imply that, at the very
least, you might want to consider using two condition vars (ie. one for
those messages where no pattern is involved, the other if there is a
pattern). If the range of possible patterns are sufficiently bounded (or can
be adequately hashed), consider using a table of condition vars; the pattern
(or its hash) being the index into the table (NB. plus one entry for
patternless messages).

Regards

Stewart

Post by Jan Wielemaker
Stewart,

Post by Stewart Adeyemo
Jan,
Just realised the .NET example / documentation may not be as

illustrative as

Post by Jan Wielemaker

Post by Stewart Adeyemo
I had initially thought, so here's some C++ code snippets that should be
more helpful.

Thanks for the hint. I do not think this is useable for me though.
The system under consideration is SWI-Prolog. Besides multiple readers
and writers, extraction by readers is based on Prolog unification, For
non-Prolog people, the readers (may) specify a pattern. If a new
message (=Prolog term) arrives on the queue, one of the readers
for which the pattern matches gets the message. The others remain
waiting.
In many practical situations there is no pattern involved or there is
only one waiter. Therefore with the queue I record whether or not
there is a waiter with a pattern. If not, I use pthread_cond_signal()
to wake up one waiter. If there is a pattern I use
pthread_cond_broadcast() to wake up all of them and let them fight for
the message.
The IoCompletionPort stuff doesn't seem to support this type of queue
easily.
Cheers --- Jan

Post by Stewart Adeyemo
void Queue::initialise()
{
// ...
// NB. completion key can be any random key (I use zero)
const DWORD COMPLETION_KEY = 0;
// create completion port
mCompletion = CreateIoCompletionPort(INVALID_HANDLE_VALUE, 0,
COMPLETION_KEY, ACTIVE_THREADS);
// ...
}
void Queue::terminate()
{
// ...
// free completion port (NB. any unprocessed tasks are lost / leaked)
CloseHandle(mCompletion);
// ...
}
void Queue::add(Task *task)
{
// ...
// post task to our completion port
DWORD key = (DWORD)task;
DWORD bytes = sizeof(key);
BOOL ok = PostQueuedCompletionStatus(mCompletion, bytes, key, 0);
// ...
}
Task * Queue::get()
{
// ...
DWORD bytes = 0;
DWORD key = 0;
LPOVERLAPPED overlapped = 0;
// retrieve pending task from completion port
BOOL ok = GetQueuedCompletionStatus(mCompletion, &bytes, &key, &overlapped,
INFINITE);
// ...
}
Regards
Stewart

Post by Stewart Adeyemo
Jan,
If performance is an issue, and you're prepared to code natively for

Win32,

Post by Stewart Adeyemo
then IOCompletion ports are the way to go (see, for example,

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/ht

Post by Jan Wielemaker

Post by Stewart Adeyemo

Post by Stewart Adeyemo
ml/progthrepool.asp - although C# based, the API is standard Win32 and is
available to C/C++). This is "the" way to implement thread pools (or event

/

Post by Stewart Adeyemo
message queues) for optimal performance on Win32.
Regards
Stewart

Post by Jan Wielemaker

Post by Joe Seigh

Post by Jan Wielemaker
Strategies for Implementing POSIX Condition Variables on Win32
Douglas C. Schmidt and Irfan Pyarali
Department of Computer Science
Washington University, St. Louis, Missouri
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
Does anyone know whether this is (still) a good state of the art
summary of the issues? Other suggestions?

The SignalOjbectAndWait solution won't work because auto reset Events
are broken AFAIK, unless Microsoft fixed it, which I somehow doubt

In what sense? I know there is no point in waiting for M$. I do not
need strict POSIX condition variables, just guarding an FIFO queue
with multiple readers and writers. Condition variables are ideal,
but probably I can be a bit more relaxed on what threads are started
if a new message arrives on the queue.
Cheers --- Jan

Joe Seigh

2004-08-18 13:12:17 UTC

Post by Stewart Adeyemo
Beware, though, of having threads contend unnecessarily for messages
following a broadcast; you become susceptible to lock convoys on the mutex
and can severely limit scalability.

Wait morphing partly addresses that problem. Actually there can be a problem
in not waking up enough threads. I've seen solutions using pthread_signal that
effectively single thread something that could be done with concurrent threads
just because they were afraid of waking up too many threads. Too little can
be worse than too much sometimes.

Joe Seigh

Jan Wielemaker

2004-08-18 14:04:58 UTC

Stewart,

Post by Stewart Adeyemo
You're correct, IOCompletionPorts do not readily support your requirements.
While you could build such support on top of them, that half defeats the
purpose of using them in the first place :-) I've done something similar in
the past where I really had to know just what tasks remained unprocessed at
shutdown. Still, and as SenderX pointed out elsewhere, IOCompletionPorts
have other benefits (inc. the ability to cooperate with the scheduler to
manage workload)
Beware, though, of having threads contend unnecessarily for messages
following a broadcast; you become susceptible to lock convoys on the mutex
and can severely limit scalability.
I know next to nothing about Prolog or its patterns, so bear that in mind
a) Can the "patterns" be considered keys (or hashed to a key) and are the
possible patterns bounded ?
b) Your recording of whether or not waiters with a pattern exists, suggest
your condition var is being used for multiple predicates; I suspect your
patterns are, in fact, your predicates. This would imply that, at the very
least, you might want to consider using two condition vars (ie. one for
those messages where no pattern is involved, the other if there is a
pattern). If the range of possible patterns are sufficiently bounded (or can
be adequately hashed), consider using a table of condition vars; the pattern
(or its hash) being the index into the table (NB. plus one entry for
patternless messages).

All this might be worthwhile considering in the future, but not really
at this very moment. The problem under consideration roughly consists
of two threads and two queues that are sending messages `around':
T1 sends a message M1 to Q1, which is read by T2, which sends M2 to
Q2. T1 reads M2 from Q2, sends M3 to Q1, etc. This little loop
goes 100 times per second on Windows using pthread-win32 and 25,000
on Linux. Prolog being a programming language though :-), in addition
to making this case run faster the original semantics of the queues may
not change (so we have the patterns, multiple writers and multiple
readers).

I'm first going to try some of the native Win32 alternatives without
worrying too much about correctness to see whether things speed up
enough.

Thanks for all comments. I'll get back with results
somewhere next week (got some other things waiting first).

Cheers --- Jan

Joe Seigh

2004-08-18 14:27:00 UTC

You also might try this. It's a fast semaphore I've posted before
with the bug in processCancels() fixed hopefully. You can try
a straight forward producers/consumers solution with it and see
how it works out.

Joe Seigh

---------------------------------------------------------

#ifndef QSYNC_H
#define QSYNC_H

#ifdef __cplusplus
extern "C" {
#endif

//----------------------------------------------------------------------------
// QSem
//----------------------------------------------------------------------------

typedef struct {
LONG count;
LONG cancel; // deferred cancel count
HANDLE hSem;
} qsem_t;

extern void InitializeQSem(qsem_t * qs, int z);
extern void DeleteQSem(qsem_t * qs);
extern DWORD WaitForQSem(qsem_t * qs, DWORD dwTimeout);
extern void ReleaseQSem(qsem_t * qs);

#ifdef __cplusplus
}
#endif

#endif // QSYNC_H

/*-*/
---------------------------------------------------------
#define _WIN32_WINNT 0x0500
#include <windows.h>
#include <winbase.h>
#include <qsync.h>

//----------------------------------------------------------------------------
//
//
// increment count by min(cancelCount, -(count)) iff count < 0
//----------------------------------------------------------------------------
void processCancels(qsem_t * qs, LONG cancelCount) {
LONG temp2, temp3, temp4, temp5;

if (cancelCount > 0) {
temp2 = qs->count;
while (temp2 < 0) {
temp3 = 0; // unprocessed cancels
temp4 = cancelCount + temp2; // new count

if (temp4 > 0) {
temp3 = temp4; // unprocessed cancels
temp4 = 0; // new count
}

temp5 = temp2;
if ((temp2 = InterlockedCompareExchange(&(qs->count), temp4, temp2)) == temp5)
break;
}

if (temp3 > 0)
InterlockedExchangeAdd(&(qs->cancel), temp3);
}

}

//----------------------------------------------------------------------------
//
//----------------------------------------------------------------------------
void InitializeQSem(qsem_t * qs, int z) {
qs->count = z;
qs->cancel = 0;
qs->hSem = CreateSemaphore(NULL, 0, 999999, NULL);
}

//----------------------------------------------------------------------------
//
//----------------------------------------------------------------------------
void DeleteQSem(qsem_t * qs) {
CloseHandle(qs->hSem);
}

//----------------------------------------------------------------------------
//
//----------------------------------------------------------------------------
DWORD WaitForQSem(qsem_t * qs, DWORD dwTimeout) {
DWORD rc = WAIT_OBJECT_0;

if (InterlockedDecrement(&(qs->count)) < 0) {
if ((rc = WaitForSingleObject(qs->hSem, dwTimeout)) != WAIT_OBJECT_0) {
// uncomment 1 and only one of the next 2 lines.
InterlockedIncrement(&(qs->cancel));
//processCancels(qs, 1);
}
}

return rc;
}

//----------------------------------------------------------------------------
//
//----------------------------------------------------------------------------
void ReleaseQSem(qsem_t * qs) {
if (qs->cancel > 0 && qs->count < 0)
processCancels(qs, InterlockedExchange(&(qs->cancel), 0));

if (InterlockedIncrement(&(qs->count)) <= 0)
ReleaseSemaphore(qs->hSem, 1, NULL);
}

/*-*/
---------------------------------------------------------

SenderX

2004-08-18 22:12:54 UTC

Post by Joe Seigh
You also might try this. It's a fast semaphore I've posted before
with the bug in processCancels() fixed hopefully. You can try
a straight forward producers/consumers solution with it and see
how it works out.

I seconds this, because Joes semaphore performance shines when the count is

Post by Joe Seigh
0. A loaded queue that makes a sem_post on every message will keep the

count > 0 most of the time:

Produce
--------------

1. Queue message - mutex access
2. Post to semaphore

Consume
---------------
1. Wait on semaphore
2. Dequeue message - mutex access

However, if you are using a mutex for the actual queue after you wait on the
semaphore: The lock-free( fast-path ) nature of the semaphore could increase
contention on the mutex a great deal...

SenderX

2004-08-26 03:07:24 UTC

Post by Joe Seigh
LONG temp2, temp3, temp4, temp5;
if (cancelCount > 0) {
temp2 = qs->count;

1. > while (temp2 < 0) {

Post by Joe Seigh
temp3 = 0; // unprocessed cancels
temp4 = cancelCount + temp2; // new count
if (temp4 > 0) {
temp3 = temp4; // unprocessed cancels
temp4 = 0; // new count
}
temp5 = temp2;
if ((temp2 = InterlockedCompareExchange(&(qs->count), temp4, temp2)) == temp5)
break;
}

2. > if (temp3 > 0)

Post by Joe Seigh
InterlockedExchangeAdd(&(qs->cancel), temp3);
}
You also might try this. It's a fast semaphore I've posted before
with the bug in processCancels() fixed hopefully.

Temp3 can still xadd a bad value... If the if at 1 fails and the if at 2 is
a success...

It should be trivial to correct...

Post by Joe Seigh
if (cancelCount > 0) {
temp2 = qs->count;

To this:

if (cancelCount > 0) {
temp3 = cancelCount;
temp2 = qs->count;

Joe Seigh

2004-08-26 11:36:12 UTC

Post by SenderX
Temp3 can still xadd a bad value... If the if at 1 fails and the if at 2 is
a success...
It should be trivial to correct...

Yeah, I think I'm going to rework the logic some. Loops with 2 exit conditions
are tricky to begin with.

Joe Seigh

Joe Seigh

2004-08-26 12:51:17 UTC

Probably something like

void processCancels(qsem_t * qs, LONG cancelCount) {
LONG oldCount, newCount, cmpCount, unprocessedCancels;

if (cancelCount > 0) {
unprocessedCancels = cancelCount;
oldCount = qs->count;

while (oldCount < 0) {
// increment old count by # cancels
if ((newCount = oldCount + cancelCount) > 0)
newCount = 0; // but not greater then zero

cmpCount = oldCount;
oldCount = InterlockedCompareExchange(&(qs->count), newCount, cmpCount);
if (oldCount == cmpCount) {
unprocessedCancels = cancelCount - (newCount - oldCount);
break;
}
}

if (unprocessedCancels > 0)
InterlockedExchangeAdd(&(qs->cancel), unprocessedCancels);
}

}

though this should work also

void processCancels(qsem_t * qs, LONG cancelCount) {
LONG oldCount, newCount, cmpCount;

oldCount = qs->count;

while (oldCount < 0 && cancelCount > 0) {
// increment old count by # cancels
if ((newCount = oldCount + cancelCount) > 0)
newCount = 0; // but not greater then zero

cmpCount = oldCount;
oldCount = InterlockedCompareExchange(&(qs->count), newCount, cmpCount);
if (oldCount == cmpCount) {
cancelCount -= (newCount - oldCount);
break;
}
}

if (cancelCount > 0)
InterlockedExchangeAdd(&(qs->cancel), cancelCount);

}

The break could be dropped even. If the CAS works then either oldCount == 0 or cancelCount == 0.
processCancels isn't a performance sensitive path so that might be acceptable.

Looks better without all the temp names.

I haven't tested this yet. I broke something with my last set of atomic_ptr changes and I have
to update some unrelated library routines at some point so my library will build.

Joe Seigh

Joe Seigh

2004-08-27 16:12:40 UTC

I think I'll go with this (reposting entire program)

Joe Seigh

---
#define _WIN32_WINNT 0x0500
#include <windows.h>
#include <winbase.h>
#include <qsync.h>

//----------------------------------------------------------------------------
//
//
// increment count by min(cancelCount, -(count)) iff count < 0
//----------------------------------------------------------------------------
void processCancels(qsem_t * qs, LONG cancelCount) {
LONG oldCount, newCount, cmpCount;

if (cancelCount > 0) {
oldCount = qs->count;

while (oldCount < 0) {
// increment old count by # cancels
if ((newCount = oldCount + cancelCount) > 0)
newCount = 0; // but not greater then zero

cmpCount = oldCount;
oldCount = InterlockedCompareExchange(&(qs->count), newCount, cmpCount);
if (oldCount == cmpCount) {
cancelCount -= (newCount - oldCount);
break;
}
}

if (cancelCount > 0)
InterlockedExchangeAdd(&(qs->cancel), cancelCount);
}

}

//----------------------------------------------------------------------------
//
//----------------------------------------------------------------------------
void InitializeQSem(qsem_t * qs, int z) {
qs->count = z;
qs->cancel = 0;
qs->hSem = CreateSemaphore(NULL, 0, 999999, NULL);
}

//----------------------------------------------------------------------------
//
//----------------------------------------------------------------------------
void DeleteQSem(qsem_t * qs) {
CloseHandle(qs->hSem);
}

//----------------------------------------------------------------------------
//
//----------------------------------------------------------------------------
DWORD WaitForQSem(qsem_t * qs, DWORD dwTimeout) {
DWORD rc = WAIT_OBJECT_0;

if (InterlockedDecrement(&(qs->count)) < 0) {
if ((rc = WaitForSingleObject(qs->hSem, dwTimeout)) != WAIT_OBJECT_0) {
// uncomment 1 and only one of the next 2 lines.
InterlockedIncrement(&(qs->cancel));
//processCancels(qs, 1);
}
}

return rc;
}

//----------------------------------------------------------------------------
//
//----------------------------------------------------------------------------
void ReleaseQSem(qsem_t * qs) {
if (qs->cancel > 0 && qs->count < 0)
processCancels(qs, InterlockedExchange(&(qs->cancel), 0));

if (InterlockedIncrement(&(qs->count)) <= 0)
ReleaseSemaphore(qs->hSem, 1, NULL);
}

/*-*/

David Schwartz

2004-08-18 18:26:26 UTC

Post by Jan Wielemaker
All this might be worthwhile considering in the future, but not really
at this very moment. The problem under consideration roughly consists
T1 sends a message M1 to Q1, which is read by T2, which sends M2 to
Q2. T1 reads M2 from Q2, sends M3 to Q1, etc. This little loop
goes 100 times per second on Windows using pthread-win32 and 25,000
on Linux. Prolog being a programming language though :-), in addition
to making this case run faster the original semantics of the queues
may not change (so we have the patterns, multiple writers and multiple
readers).

Measuring performance in such artificial 'tight loops' tends to give
very, *very* deceptive results that have little applicability to real-world
applications.

DS

Jan Wielemaker

2004-08-19 09:44:27 UTC

Post by David Schwartz

Post by Jan Wielemaker
All this might be worthwhile considering in the future, but not really
at this very moment. The problem under consideration roughly consists
T1 sends a message M1 to Q1, which is read by T2, which sends M2 to
Q2. T1 reads M2 from Q2, sends M3 to Q1, etc. This little loop
goes 100 times per second on Windows using pthread-win32 and 25,000
on Linux. Prolog being a programming language though :-), in addition
to making this case run faster the original semantics of the queues
may not change (so we have the patterns, multiple writers and multiple
readers).

Measuring performance in such artificial 'tight loops' tends to give
very, *very* deceptive results that have little applicability to real-world
applications.

Usually, yes. This particular tight loop however comes from a real
application. One could start a discussion whether the application
could be redesigned to avoid this trouble. Given the queue
performance seen on Linux however the implementation is an adequate
solution to a problem. As a language implementor, all I want to
achieve is to get acceptable and preferably cross-platform comparable
performance for a the runtime support library. 250 times difference
on comparable hardware is clearly too much.

Cheers --- Jan

David Schwartz

2004-08-19 18:13:09 UTC

Post by Jan Wielemaker

Post by David Schwartz
Measuring performance in such artificial 'tight loops' tends to
give very, *very* deceptive results that have little applicability
to real-world applications.

Let me just start by saying that I'm making quite a few assumptions
about what's going on here and you really should profile your application

Post by Jan Wielemaker
Usually, yes. This particular tight loop however comes from a real
application. One could start a discussion whether the application
could be redesigned to avoid this trouble. Given the queue
performance seen on Linux however the implementation is an adequate
solution to a problem.

This is not true. It is probably *not* an adequate solution, just one
that happens to work acceptable on that hardware with that operating system
and that threads library for that workload. Given the new evidence you have,
it should be obvious that the implementation does not provide adequate
performance.

Post by Jan Wielemaker
As a language implementor, all I want to
achieve is to get acceptable and preferably cross-platform comparable
performance for a the runtime support library. 250 times difference
on comparable hardware is clearly too much.

Right, so the implementation is definitely not adequate. It's too
dependent upon the platform happening to be optimized for some one
particular thing that clearly not all platforms are optimized for.

Maybe it's context switches? Maybe it's a weird case in which one thread
does pre-empt another on one platform and not on the other (and it could be
either way around).

You have judged your code adequate based upon experience on only one
platform where it could be adequate only because of some quirk of that
platform. This is an extremely common phenomenon.

I could go on with lots of examples if you'd like where a seriously
defiicient implementation happened to work stellar on one platform and
terribly on another even though the implementations made equally good
decisions about how to handle it. One just happened to be what worked for
that application and one happened not to be.

DS

Jan Wielemaker

2004-08-20 08:12:41 UTC

Post by David Schwartz
This is not true. It is probably *not* an adequate solution, just one
that happens to work acceptable on that hardware with that operating system
and that threads library for that workload. Given the new evidence you have,
it should be obvious that the implementation does not provide adequate
performance.

Post by Jan Wielemaker
As a language implementor, all I want to
achieve is to get acceptable and preferably cross-platform comparable
performance for a the runtime support library. 250 times difference
on comparable hardware is clearly too much.

Right, so the implementation is definitely not adequate. It's too
dependent upon the platform happening to be optimized for some one
particular thing that clearly not all platforms are optimized for.

I can follow your purist style of reasoning, but this isn't going to
help much in practice. As a language implementator I try to provide a
system that is portable, both in terms of its semantics as (roughly)
in terms of performance. I do not believe there is a way to satisfy
these two constraints 100% as both hardware and operating systems make
choices that make both 100% semantic equivalence and comparable
performance very hard. The combination is even worse. pthread-win32
tries to be an implementation of pthread. It falls short at verious
places (pthread_kill, cancelation). I guess that cannot be avoided.
It also falls short providing roughly comparable performance. I've
done some prelimenary tests using native Win32 synchronisation and
this indicates I get very comparable results between Linux and
Windows (same hardware). Some work needs to be done to verify the
new implementation is correct with regard to the queue behaviour I
promise in the manual, but I'm pretty sure now it will work out.

From this (and earlier) experience I conclude that the use of low-level
compatibility libraries is often a bad choice. Grounding your work on
a standard (de-facto or ISO/ANSI/...) and then hoping for compatibility
(emulation?) layers to provide portability often leads to disappointing
results. Looking at the problem from a somewhat higher level and making
native implementations for different OSes gives much better result.

Anyway, thanks for all the input. Some remarks have directly contributed
to my current implementation. If ever performance of this part
is going to be a bottleneck I've got plenty of clues what to do next :-)

Cheers --- Jan

SenderX

2004-08-17 23:35:43 UTC

Post by Stewart Adeyemo
If performance is an issue, and you're prepared to code natively for Win32,
then IOCompletion ports are the way to go (see, for example,

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/ht

Post by Stewart Adeyemo
ml/progthrepool.asp - although C# based, the API is standard Win32 and is
available to C/C++). This is "the" way to implement thread pools (or event /
message queues) for optimal performance on Win32.

IOCP is the way to go if your dealing with IO, IMHO...

A "*batched" lock-free queue will blow IOCP away only wrt produce to consume
throughput, IOCP calls are too expensive to compare against a simple cmpxchg
loop.

However, IOCP is optimized to consume on specific threads, it will use
thread A over and over even though Thread's B-G are waiting. Because the box
only had 1 cpu. If you run the same program on a box with 4 cpus, Thread's
B - D will be used. IOCP also has a mechanism to detect whether a worker
thread is blocking, so it will "sometimes" use more threads than there are
processors. Its very nice, it also achieves good cache performance because
it uses the same threads over and over again. ( threads would be bound to
specific cpus ).

You can engineer your lock-free queues to run like this, it would probably
have to be a specialized job.

* a batched queue means that you can push a batch of nodes on the queue, and
you can pop batches of nodes from the queue.

Stewart Adeyemo

2004-08-18 12:13:02 UTC

SenderX,

Interesting. If, per chance, you have any numbers to support this I would be
most interested in seeing them. In any case, when I have a moment or two,
I'll be doing some investigating / testing of my own.

BTW, I would also love to see / try out your lock free library when it's
done; I trust you'll be notifying us here.

Regards

Stewart

Post by Stewart Adeyemo

Post by Stewart Adeyemo
If performance is an issue, and you're prepared to code natively for

Win32,

Post by Stewart Adeyemo
then IOCompletion ports are the way to go (see, for example,

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/ht

Post by Stewart Adeyemo

Post by Stewart Adeyemo
ml/progthrepool.asp - although C# based, the API is standard Win32 and is
available to C/C++). This is "the" way to implement thread pools (or

event

Post by Stewart Adeyemo
/

Post by Stewart Adeyemo
message queues) for optimal performance on Win32.

IOCP is the way to go if your dealing with IO, IMHO...
A "*batched" lock-free queue will blow IOCP away only wrt produce to consume
throughput, IOCP calls are too expensive to compare against a simple cmpxchg
loop.
However, IOCP is optimized to consume on specific threads, it will use
thread A over and over even though Thread's B-G are waiting. Because the box
only had 1 cpu. If you run the same program on a box with 4 cpus, Thread's
B - D will be used. IOCP also has a mechanism to detect whether a worker
thread is blocking, so it will "sometimes" use more threads than there are
processors. Its very nice, it also achieves good cache performance because
it uses the same threads over and over again. ( threads would be bound to
specific cpus ).
You can engineer your lock-free queues to run like this, it would probably
have to be a specialized job.
* a batched queue means that you can push a batch of nodes on the queue, and
you can pop batches of nodes from the queue.

Alexander Terekhov

2004-08-18 10:48:26 UTC

Jan Wielemaker wrote:
[...]

Post by Jan Wielemaker
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html

http://groups.google.com/groups?selm=3AEAC433.1595FF4%40web.de

regards,
alexander.

SenderX

2004-08-17 21:34:12 UTC

Post by Jan Wielemaker
Although I'm not 100% sure yet, I think there is an *enourmous*
difference in preformance in pthread_cond_timedwait() on MS-Windows
(pthread-win32) and the Linux version. Roughly the Windows version
processes 100 messages per second, while the Linux version does
25,000 of them.

Win32 mutex seems to hand off ownership across context-switch. Its very slow
for contended case.

http://lists.boost.org/MailArchives/boost/msg64648.php
( Alex's mutex blows CRITICAL_SECTION away )

Use Joes fast convar logic, with alexs fast mutexs... And you will get NPTL
like performance on win32.

If you are still not happy with performance, you can start researching
lock-free algorithms.

P.S.

I am going to put out a production quality lock-free library that is able to
run on Win32/64, WinCE, Linux, and Solaris under i586 processors.

It also looks like my stuff can easily port to mac's using ll/sc. I will
need some help on this part, but that for later. x86 first...

I need to finish up testing my RCU implementation, and then license the damn
thing.

;(

SenderX

2004-08-17 21:39:25 UTC

Post by SenderX
I am going to put out a production quality lock-free library that is able to
run on Win32/64, WinCE, Linux, and Solaris under i586 processors.

DOH! I meant:

Win32/64, Linux, and Solaris under i586 processors. It will also run on
WinCE.

Alexander Terekhov

2004-08-18 10:47:56 UTC

Jan Wielemaker wrote:
[...]

Post by Jan Wielemaker
Although I'm not 100% sure yet, I think there is an *enourmous*
difference in preformance in pthread_cond_timedwait() on MS-Windows
(pthread-win32) and the Linux version. Roughly the Windows version
processes 100 messages per second, while the Linux version does
25,000 of them.

You shouldn't wait if you have data to process. Your problem is
elsewhere (not related to "performance" of pthread_cond_*wait()),
I suppose.

regards,
alexander.

Joe Seigh

2004-08-18 11:32:23 UTC

Post by Alexander Terekhov
[...]

Post by Jan Wielemaker
Although I'm not 100% sure yet, I think there is an *enourmous*
difference in preformance in pthread_cond_timedwait() on MS-Windows
(pthread-win32) and the Linux version. Roughly the Windows version
processes 100 messages per second, while the Linux version does
25,000 of them.

You shouldn't wait if you have data to process. Your problem is
elsewhere (not related to "performance" of pthread_cond_*wait()),
I suppose.

The internal locking of some of the win32 condvar implementations can
cause threads to block even when they have work to do. Some implementations
are worse than others.

Joe Seigh

31 Replies
44 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Jan Wielemaker 2004-08-17 10:08:28 UTC

Joe Seigh 2004-08-17 10:39:29 UTC

Jan Wielemaker 2004-08-17 14:11:45 UTC

Joe Seigh 2004-08-17 14:53:35 UTC

Ronald Landheer-Cieslak 2004-08-17 15:54:27 UTC

Jan Wielemaker 2004-08-17 18:54:04 UTC

red floyd 2004-08-17 19:35:11 UTC

Joe Seigh 2004-08-17 20:06:08 UTC

Jan Wielemaker 2004-08-17 20:27:55 UTC

Stewart Adeyemo 2004-08-17 22:13:33 UTC

Stewart Adeyemo 2004-08-17 22:37:23 UTC

Jan Wielemaker 2004-08-18 08:24:34 UTC

Stewart Adeyemo 2004-08-18 12:15:17 UTC

Joe Seigh 2004-08-18 13:12:17 UTC

Jan Wielemaker 2004-08-18 14:04:58 UTC

Joe Seigh 2004-08-18 14:27:00 UTC

SenderX 2004-08-18 22:12:54 UTC

SenderX 2004-08-26 03:07:24 UTC

Joe Seigh 2004-08-26 11:36:12 UTC

Joe Seigh 2004-08-26 12:51:17 UTC

Joe Seigh 2004-08-27 16:12:40 UTC

David Schwartz 2004-08-18 18:26:26 UTC

Jan Wielemaker 2004-08-19 09:44:27 UTC

David Schwartz 2004-08-19 18:13:09 UTC

Jan Wielemaker 2004-08-20 08:12:41 UTC

SenderX 2004-08-17 23:35:43 UTC

Stewart Adeyemo 2004-08-18 12:13:02 UTC

Alexander Terekhov 2004-08-18 10:48:26 UTC

SenderX 2004-08-17 21:34:12 UTC

SenderX 2004-08-17 21:39:25 UTC

Alexander Terekhov 2004-08-18 10:47:56 UTC

Joe Seigh 2004-08-18 11:32:23 UTC

about - legalese

Loading...