Discussion:
Confused over the use of threads in multi-client programming
(too old to reply)
anastasiA
2004-02-19 15:34:56 UTC
Permalink
I'm still confused over the use of threads for multi-client programming.

Books on network programming are generally sympathetic to the use of
threads in the context of multi-client programming, but also push the
issue so that the reader may consult more specialized books. Books on
thread programming show little or no interest in multi-client servers
that run beyond a couple of connections at a time for short specific tasks: the
"business" model always prevails and I'm always stuck with examples on
how to create ATM servers. :-)

Well I want to create a chat server. Should every connection have it's own
thread ? I've read Bil LambdaCS's excellent Hodge-Podge Thread FAQ, and
Q285: I need to create about 5000 threads?
I need to create about 5000 threads simultaneously (it'll be a powerful
server on NT machine, each client invokes it's own thread)
Did anybody write programs like this? Is it a good idea to invoke a thread
for each connection?
No, it's really bad.
Is it bad because because of the NT achitecture ? I am using Linux ...
so does this warning apply to any OS ?

Would threads be of any benefit for a massive number of connections ?
Or simply allow one thread for connection and message polling and
subdivise the other tasks with other threads ?

Thanks.

--
a.
David Butenhof
2004-02-19 16:38:24 UTC
Permalink
Post by anastasiA
I'm still confused over the use of threads for multi-client programming.
Books on network programming are generally sympathetic to the use of
threads in the context of multi-client programming, but also push the
issue so that the reader may consult more specialized books. Books on
thread programming show little or no interest in multi-client servers
that run beyond a couple of connections at a time for short specific
tasks: the "business" model always prevails and I'm always stuck with
examples on how to create ATM servers. :-)
Well I want to create a chat server. Should every connection have it's
own thread ? I've read Bil LambdaCS's excellent Hodge-Podge Thread FAQ,
Q285: I need to create about 5000 threads?
I need to create about 5000 threads simultaneously (it'll be a powerful
server on NT machine, each client invokes it's own thread)
Did anybody write programs like this? Is it a good idea to invoke a
thread for each connection?
No, it's really bad.
Is it bad because because of the NT achitecture ? I am using Linux ...
so does this warning apply to any OS ?
Would threads be of any benefit for a massive number of connections ?
Or simply allow one thread for connection and message polling and
subdivise the other tasks with other threads ?
On a multiprocessor system, adding COMPUTE BOUND threads is good up to the
point where you've saturated your quota of CPUs (which may be the entire
system, or some subset, and may or may not vary with time). That is, if
you've got 2 CPUs, creating 3 compute-bound threads is worse than
pointless; you'll slow them all down rather than improving anything.

I/O-bound servers are often better served by using asynchronous I/O
mechanisms (available on Windows and recent POSIX systems, among others)
rather than threads, because each async I/O may be (perhaps substantially)
cheaper, and can be more directly managed by the system. (It knows exactly
what they're "up to", because the interface is specific to the task rather
than being extremely general, as threads are.)

If you're using synchronous I/O, either because you must or because you want
to, then you CAN benefit by adding threads, up to a certain point, because
multiple threads can make your synchronous I/O operations "pseudo
asynchronous". But adding threads for I/O doesn't scale as well, or as
predictably, as asynchronous I/O. You will saturate the library and/or
kernel's ability to manage them at some point, and that point is not easily
predictable (and not remotely portable). "5000" is almost certainly orders
of magnitude too high, for example.

If you are using threads to achieve asynchronous I/O operations, especially
on behalf of multiple peer clients, 'thread per connection' models just
don't scale. Furthermore, thread scheduling isn't designed to support the
sort of "fairness" guarantees that you need to make it hold together; some
clients will get terrible response while others monopolize your I/O and
compute resources. Instead, you should build some variant of internal
scheduling more appropriate to your throughput needs. You generally do that
by creating a small pool of worker threads that will handle queued
operations on behalf of a set of clients. YOU control the scheduling of the
client requests to that pool, and let the OS worry about allocating CPU
time and I/O bandwidth to that small set of workers.

Often even if you're using asynchronous I/O, you might benefit from a thread
pool if there's any non-trivial computation involved in setting up or
finalizing the I/O, because then your workers can deal with all that in
parallel on a multiprocessor.
--
/--------------------[ ***@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\----[ http://homepage.mac.com/dbutenhof/Threads/Threads.html ]---/
Patrick TJ McPhee
2004-02-19 20:44:00 UTC
Permalink
In article <***@devnull.net>,
anastasiA <***@devnull.net> wrote:

% > Did anybody write programs like this? Is it a good idea to invoke a thread
% > for each connection?
% > No, it's really bad.

% Is it bad because because of the NT achitecture ? I am using Linux ...
% so does this warning apply to any OS ?

This is generally good advice regardless of the system in use. One point
of use multiple threads in a process is to make the best use of resources,
in particular CPU resources. Once you get past a certain number of threads,
you're not doing that any more -- your threads are contending with themselves
for resources, and you can easily reach the point where multi-threading adds
overhead and gives nothing in return. On many machines, using the 5,000
threads of bil's example will end up being much slower than having a
single thread handling all connections serially.

% Would threads be of any benefit for a massive number of connections ?

Sure, but the key is to have the correct number of threads. Tie the
number of threads to the number of CPUs, network connections, and whatever
else you have that needs to be shared, then arrange your code so the
work is shared amongst all the threads. With the scenario in the example
is using `a thread for each connection', not having many connections per se.
--
Patrick TJ McPhee
East York Canada
***@interlog.com
David Schwartz
2004-02-19 21:38:13 UTC
Permalink
Post by anastasiA
Q285: I need to create about 5000 threads?
I need to create about 5000 threads simultaneously (it'll be a powerful
server on NT machine, each client invokes it's own thread)
Did anybody write programs like this? Is it a good idea to invoke a thread
for each connection?
No, it's really bad.
Is it bad because because of the NT achitecture ? I am using Linux ...
so does this warning apply to any OS ?
Would threads be of any benefit for a massive number of connections ?
Or simply allow one thread for connection and message polling and
subdivise the other tasks with other threads ?
It's bad because you want threads for things you are going to do
concurrently. You cannot usefully do 5,000 things concurrently, so you
cannot usefully have 5,000 threads.

You can employ all the physical processors present. You can usefully do
perhaps a dozen I/O operations at a time. You may need a few threads for
things that require their own threads for implementation reasons. So you
typically wind up with 100 threads as a reasonable maximum.

It is bad design to connect totally dissimilar objects one-to-one. A
thread is an execution vehicle.

DS
Robert Wessel
2004-02-20 00:01:43 UTC
Permalink
Post by anastasiA
I'm still confused over the use of threads for multi-client programming.
Books on network programming are generally sympathetic to the use of
threads in the context of multi-client programming, but also push the
issue so that the reader may consult more specialized books. Books on
thread programming show little or no interest in multi-client servers
that run beyond a couple of connections at a time for short specific tasks: the
"business" model always prevails and I'm always stuck with examples on
how to create ATM servers. :-)
Well I want to create a chat server. Should every connection have it's own
thread ? I've read Bil LambdaCS's excellent Hodge-Podge Thread FAQ, and
Q285: I need to create about 5000 threads?
I need to create about 5000 threads simultaneously (it'll be a powerful
server on NT machine, each client invokes it's own thread)
Did anybody write programs like this? Is it a good idea to invoke a thread
for each connection?
No, it's really bad.
Is it bad because because of the NT achitecture ? I am using Linux ...
so does this warning apply to any OS ?
Would threads be of any benefit for a massive number of connections ?
Or simply allow one thread for connection and message polling and
subdivise the other tasks with other threads ?
In addition to what David said: Large numbers of threads tend to
create large amounts of kernel overhead, although on 32 bit systems,
that's almost a secondary consideration. Where exactly are you going
to put the stacks for 5000 threads? Win32 gives you a 2GB user
address space (3GB in *some* cases), and a 1MB default stack size
(allocated address space, not actually committed).
Giancarlo Niccolai
2004-02-21 02:51:12 UTC
Permalink
Post by anastasiA
I'm still confused over the use of threads for multi-client programming.
Books on network programming are generally sympathetic to the use of
threads in the context of multi-client programming, but also push the
issue so that the reader may consult more specialized books. Books on
thread programming show little or no interest in multi-client servers
that run beyond a couple of connections at a time for short specific
tasks: the "business" model always prevails and I'm always stuck with
examples on how to create ATM servers. :-)
Well I want to create a chat server. Should every connection have it's
own thread ? I've read Bil LambdaCS's excellent Hodge-Podge Thread FAQ,
Q285: I need to create about 5000 threads?
I need to create about 5000 threads simultaneously (it'll be a powerful
server on NT machine, each client invokes it's own thread)
Did anybody write programs like this? Is it a good idea to invoke a
thread for each connection?
No, it's really bad.
Is it bad because because of the NT achitecture ? I am using Linux ...
so does this warning apply to any OS ?
Would threads be of any benefit for a massive number of connections ?
Or simply allow one thread for connection and message polling and
subdivise the other tasks with other threads ?
Thanks.
And in addition to what David and Robert sayed, I add that CONTENTION goes
up EXPONENTIALLY with the number of threads. Your threads will have to
cooperate at some point, and this means that they will have to share at
least a mutex somewhere. Rise the thread count and you will rise the
likelyhood that a thread will find that mutex unavailable for locking when
needed. This means that the thread will be scheduled out and later on
rescheduled in; if the time in which the thread is scheduled out is long
enough, it may get even its memory swapped out and re-swapped in as soon as
it is rescheduled for execution (and mutex locking). This may hog down your
application so much that 99% of time it will be scheduling threads around
and swapping memory, and 1% of the time it will be running your code...
pretty undesiderable.

My advice is to use more threads only 1) as long as you have phisical CPU
and 2) as long as they have to wait for relatively long times. When you
have to compute, use sequential programming, or split sequential
programming on available processors, but as David sayed, avoid to put
computational tasks in different threads.

On my opinion, 5-10 threads + CPU count are a fair limit, provided those
5-10 threads are meant to run only "sometimes", so that their waits can be
exploited by the other threads. This is what threading is good for: exploit
thread dead times in other threads... with more than 10 threads it's
unlikely you'll have any deadtime to exploit.

Giancarlo.
Robert Wessel
2004-02-21 08:41:24 UTC
Permalink
Post by Giancarlo Niccolai
And in addition to what David and Robert sayed, I add that CONTENTION goes
up EXPONENTIALLY with the number of threads. Your threads will have to
cooperate at some point, and this means that they will have to share at
least a mutex somewhere. Rise the thread count and you will rise the
likelyhood that a thread will find that mutex unavailable for locking when
needed. This means that the thread will be scheduled out and later on
rescheduled in; if the time in which the thread is scheduled out is long
enough, it may get even its memory swapped out and re-swapped in as soon as
it is rescheduled for execution (and mutex locking). This may hog down your
application so much that 99% of time it will be scheduling threads around
and swapping memory, and 1% of the time it will be running your code...
pretty undesiderable.
This is not exactly true. An increase in the number of threads
certainly increases the *potential* amount of contention for any
shared resource. If the amount of actual use of the shared resources
is proportional to the number of threads, then you can have this
effect, else not.

Contention is best understood in terms of queuing theory. Assume that
you have a request arrival rate of r (per unit time), and a service
time of t (as a fraction of unit time). As r*t approaches 1, queue
depth increase exponentially. In a real system, the queue depth is
ultimately limited by the number of active requestors (threads in this
case). Let's say you've got a thousand threads, and a service time of
.1. As the request rate approaches 10, the queue depth will tend to
approach 1000, leading to latencies on the order of 100. If that
happens you've got a serious bottleneck, being made worse by all the
non-productive overhead associated with the contention.

For many applications t tends to remain fairly constant, so we can
mostly ignore it here.

The real question is how is the request rate, r, related to the number
of threads. Fortunately, in many cases the answer is weakly, if at
all. Consider, for example, a web server designed to server 1000
simultaneous users, each clicking on a link once every other second.
It matters little how many threads are created, the users will still
be generating 500 request per second, and so long as the service time
for the shared resource is well below 2ms, there's be minimal
contention no matter how many threads are running.
Post by Giancarlo Niccolai
My advice is to use more threads only 1) as long as you have phisical CPU
and 2) as long as they have to wait for relatively long times. When you
have to compute, use sequential programming, or split sequential
programming on available processors, but as David sayed, avoid to put
computational tasks in different threads.
On my opinion, 5-10 threads + CPU count are a fair limit, provided those
5-10 threads are meant to run only "sometimes", so that their waits can be
exploited by the other threads. This is what threading is good for: exploit
thread dead times in other threads...
I'm pretty sure we're agreeing here.
Post by Giancarlo Niccolai
with more than 10 threads it's
unlikely you'll have any deadtime to exploit.
This is a generalization that's impossible to make. And probably one
that's quite wrong in the context of this thread. The OP was asking
about a client server using a thread per client. If the sessions
exist for an extended time, they'll probably spend most of their time
blocking, waiting for the clients.

Also, I/O of all sorts can block, and, as an example, on large systems
the number of queued request to a large disk array can be quite large
before latencies become objectionable.

And many threads exist to provide services to other threads, and these
often run synchronously with the requestor, and then don't really
count.

But you're quite right that you want to generally avoid having many
more runnable threads in the system than you have CPUs.

The real difficulty, of course, is how you deal with protocols that
are not neatly stateless like HTTP. With long running stateful
sessions the attraction of the thread-per-client model is that it lets
you build the service process in the most simple and natural way.
Unwinding even a moderately complex service process into a state
machine can be very painful (not to mention ugly). Once you've got
state to maintain you have storage constraining the number of sessions
you can support, especially if the state is not small, and all the
sessions are in a single process on a small address space machine.
How well the OS deals with large numbers of (hopefully blocked)
threads and virtual storage constraints are then prime issues.
Giancarlo Niccolai
2004-02-21 09:25:41 UTC
Permalink
Post by Robert Wessel
This is not exactly true. An increase in the number of threads
certainly increases the *potential* amount of contention for any
shared resource. If the amount of actual use of the shared resources
is proportional to the number of threads, then you can have this
effect, else not.
Yes, I wanted just to point out the worst case, that for the Murphy laws,
will certainly be YOUR case. Anyhow I am pretty sure to have stated that is
the LIKELYHOOD of contention to rise exponentially, not necessarily the
actual contention.

Giancarlo
Patrick TJ McPhee
2004-02-21 21:08:12 UTC
Permalink
In article <***@posting.google.com>,
Robert Wessel <***@yahoo.com> wrote:

% are not neatly stateless like HTTP. With long running stateful
% sessions the attraction of the thread-per-client model is that it lets
% you build the service process in the most simple and natural way.

But the question becomes, what are the threads giving you that you wouldn't
have, for instance, using multiple processes?
--
Patrick TJ McPhee
East York Canada
***@interlog.com
Robert Wessel
2004-02-24 03:36:42 UTC
Permalink
Post by Patrick TJ McPhee
% are not neatly stateless like HTTP. With long running stateful
% sessions the attraction of the thread-per-client model is that it lets
% you build the service process in the most simple and natural way.
But the question becomes, what are the threads giving you that you wouldn't
have, for instance, using multiple processes?
Well, I think we're assuming there's some reason to service these all
from the same process. Some underlying shared resource, for example.
SenderX
2004-02-22 00:22:46 UTC
Permalink
Post by Giancarlo Niccolai
And in addition to what David and Robert sayed, I add that CONTENTION goes
up EXPONENTIALLY with the number of threads. Your threads will have to
cooperate at some point, and this means that they will have to share at
least a mutex somewhere.
Ahhh. Lock-Free algos go faster when the number of threads increase.

:)


P4 1cpu HyperThread processor gets decreacing run times ( GetTickCount() )
when the threads are increased...


I had to say something...


;)
David Schwartz
2004-02-22 01:07:38 UTC
Permalink
Post by SenderX
Post by Giancarlo Niccolai
And in addition to what David and Robert sayed, I add that CONTENTION goes
up EXPONENTIALLY with the number of threads. Your threads will have to
cooperate at some point, and this means that they will have to share at
least a mutex somewhere.
Ahhh. Lock-Free algos go faster when the number of threads increase.
Nonsense. If there are more threads than CPUs, they stay the same speed.
(Why should ten unscheduled threads be better or worse than five unscheduled
threads?) If there are fewer threads than CPUs, they get slower. (The more
CPUs contending for the same cache line, the higher the cost of cache
contention. Also, with many lock-free algos, the more CPUs running threads,
them ore 'no forward progress' operations per 'forward progress' operation.)

DS
SenderX
2004-02-22 02:31:15 UTC
Permalink
Post by David Schwartz
Nonsense. If there are more threads than CPUs, they stay the same speed.
Starting at 1 thread: The performance of lock-free queues on single cpu-p4
hyperthreaded processors, under extremely high-focused contention, gets
faster when 2-3 threads are added.

If you go beyond 3-4 threads the lock-free performance starts to softly
degrade.

Each queue should be bound to a specific logical processor.

The more processors, the more queues. Scales great.
Patrick TJ McPhee
2004-02-23 04:14:48 UTC
Permalink
In article <SvUZb.240463$U%***@attbi_s03>, SenderX <***@xxx.com> wrote:
% > Nonsense. If there are more threads than CPUs, they stay the same
% speed.
%
% Starting at 1 thread: The performance of lock-free queues on single cpu-p4
% hyperthreaded processors, under extremely high-focused contention, gets
% faster when 2-3 threads are added.
%
% If you go beyond 3-4 threads the lock-free performance starts to softly
% degrade.

OK, so what happens when you get to the 5000 under discussion?
--
Patrick TJ McPhee
East York Canada
***@interlog.com
SenderX
2004-02-24 05:24:23 UTC
Permalink
Post by Patrick TJ McPhee
% If you go beyond 3-4 threads the lock-free performance starts to softly
% degrade.
OK, so what happens when you get to the 5000 under discussion?
The same as when you get to thousands of threads under lock-based... It
starts to degrade...

But, a single lock-free queue softly degrades under more and more threads.
Lock-based queues will be blocking ALL concurrent threads that access the
queue.
David Schwartz
2004-02-24 08:31:14 UTC
Permalink
Post by SenderX
But, a single lock-free queue softly degrades under more and more threads.
Lock-based queues will be blocking ALL concurrent threads that access the
queue.
Exactly. That's why lock-based algorithms are so much better than
lock-free ones. Why should I have one thread running on each CPU fighting
for the same contended resource and causing massive FSB traffic when I could
just block the contending threads and allow tasks that won't contend to run
instead?

Consider a case where I have two CPUs and four threads. Two of the four
threads want to access collection A (call them A1 and A2) and the other of
the four threads want to access collection B (call them B1 and B2).

If B1 and B2 run concurrently, performance will suffer and FSB traffic
will skyrocket. However if A1 and B1 run concurrently, there will be no
adverse affects and both threads will run at full speed.

A lock-based algorithm guarantees that when A1 is running, the other CPU
will rapidly schedule B1. A1 and A2 will not be permitted to run
concurrently. Thus both CPUs will be fully employed, FSB traffic will be
minimized and performance will be excellent.

With a lock-free algorithm, about half the time A1 and A2 will run
concurrently or B1 and B2 will run concurrently. So about half the time, the
FSB will be trashed and both CPUs will be running code at dramatically
reduced speed. Any other CPUs in the system will also be negatively affected
by the FSB traffic.

Locks are not bad, contention is bad. Lock-free algorithms, by design,
maximize contention to maximize concurrency. This is only acceptable when
you really need that concurrency. In most realistic, user-space
applications, you don't need that concurrency because there are enough other
things to do. This is why contention avoidance (locks) is the norm in
user-space.

DS
SenderX
2004-02-24 10:43:47 UTC
Permalink
Post by David Schwartz
Locks are not bad, contention is bad.
Locks in bad places, are bad...

;)
Post by David Schwartz
Lock-free algorithms, by design,
maximize contention to maximize concurrency. This is only acceptable when
you really need that concurrency. In most realistic, user-space
applications, you don't need that concurrency because there are enough other
things to do. This is why contention avoidance (locks) is the norm in
user-space.
I consider the times it takes for a node to get pushed and popped, node
throughput. I an concerned with the overall node throughput of the
queue/collection David.

I have run extensive tests that measure the time it takes for a node to get
pushed and popped with different types of lock-based and lock-free queues
under contention, and lock-free wins all the time. Sometimes its a small
win, and sometimes its a big-win. It depends on the tests configured
cpu/thread-count.

If you can pop your nodes faster under contention, then your cpus will get
their jobs faster and have more of a chance to run them in parallel instead
of blocking on a lock-based queue...
Örjan Petersson
2004-02-24 15:14:58 UTC
Permalink
Post by SenderX
I have run extensive tests that measure the time it takes for a node to get
pushed and popped with different types of lock-based and lock-free queues
under contention, and lock-free wins all the time. Sometimes its a small
win, and sometimes its a big-win. It depends on the tests configured
cpu/thread-count.
Do you have any public reports of those benchmarks?
--
Orjan Petersson, Logcode SARL
The email address in the From: header is valid
SenderX
2004-02-24 17:49:27 UTC
Permalink
Post by Örjan Petersson
Post by SenderX
I have run extensive tests that measure the time it takes for a node to get
pushed and popped with different types of lock-based and lock-free queues
under contention, and lock-free wins all the time. Sometimes its a small
win, and sometimes its a big-win. It depends on the tests configured
cpu/thread-count.
Do you have any public reports of those benchmarks?
Not currently. I am compiling my data.

In all the lock-free papers I have looked at, I have yet to see data on
node-throughput...
SenderX
2004-02-24 20:47:20 UTC
Permalink
Post by Örjan Petersson
Do you have any public reports of those benchmarks?
My testing has been against Windows 2000/XP CRITICAL_SECTION,
CreateSemaphore, & CreateMutex.

Now that I have a working Linux i686 version of AppCore's lock-free system
(thanks to gcc and gas), I have to test against NPTL. This will be
interesting because I believe that FUTEX is used in many of the sync
primitives the library provides? I think a sem_t should behave like a
lock-free semaphore? I have to look at the source...

If a FUTEX's provides lock-free to access a sem_t, it should boost
performance far beyond its lock-based cousin.

My library provides several different lock-free semaphore implementations,
and I need to compare and contrast each one to NPTL sem_t...

:O
David Schwartz
2004-02-24 21:03:58 UTC
Permalink
Post by SenderX
I consider the times it takes for a node to get pushed and popped, node
throughput. I an concerned with the overall node throughput of the
queue/collection David.
You are measuring the wrong thing. If all you measure is how much money
you make, then being a thief looks like a very attractive profession.

Do you understand my criticism? For example, do you undestand why a
4-CPU system with two threads running on a lock-free collection will slow
the other two CPUs down much more than two threads running a locked
collection will? (And no, it's not because only one thread will be running.
I said "two threads running" and I meant it.)
Post by SenderX
I have run extensive tests that measure the time it takes for a node to get
pushed and popped with different types of lock-based and lock-free queues
under contention, and lock-free wins all the time. Sometimes its a small
win, and sometimes its a big-win. It depends on the tests configured
cpu/thread-count.
Again, you are measuring the wrong thing. How fast are the *other*
threads in the system running? Or, to put it another way, are you making
efficient use of the CPU or are you just monopolizing and squandering it?
Post by SenderX
If you can pop your nodes faster under contention, then your cpus will get
their jobs faster and have more of a chance to run them in parallel instead
of blocking on a lock-based queue...
No, they'll get their jobs done slower because of the cache coherency
traffic. You'd rather the CPU run a thread that won't encounter contention.

I hope you won't take this the wrong way, but your comments suggest you
do not understand what I'm saying. If that's true, please go back a few
posts and go over what I'm saying again (the bit where I called threads A1,
A2, B1, and B2 and discussed which get scheduled).

DS
SenderX
2004-02-24 22:49:53 UTC
Permalink
Post by David Schwartz
Do you understand my criticism? For example, do you undestand why a
4-CPU system with two threads running on a lock-free collection will slow
the other two CPUs down much more than two threads running a locked
collection will? (And no, it's not because only one thread will be running.
I said "two threads running" and I meant it.)
Bzzzt...

You can actually "prove yourself wrong"!

Here is a set of "pre-coded", easy to read, lock-free garbage collected and
lock-based linked-lists:

http://groups.google.com/groups?selm=3EF64B1F.8693F354%40xemaps.com

It uses the "pre-coded", all-mighty, atomic_ptr:

http://groups.google.com/groups?selm=3FDE3890.79C24B88%40xemaps.com
Post by David Schwartz
For example, do you undestand why a
4-CPU system with two threads running on a lock-free collection will slow
the other two CPUs down much more than two threads running a locked
collection will?
Just imagine if one of your two threads (Thread1) decided to iterate the
collection...


Lock-Free:

Hummm. Do we have to lock... NOPE!

That's good because we have to iterate 500,000 ITEMS!

Oh sh^#!!!

Thread2 wants to write! No worry...

Thread2 writes to the collection while Thread1 iterates.

Thread1 writes while Thread2 iterates

Lock-Free makes this possible.



Lock-Based:


Now... Think if we had locks...

CRAP!

Thread2 would block and wait for Thread1 to finish processing 500,000 items
just to make a single write to the collection!

Thread2 waits 2 or 3 seconds for Thread1 to finish iterating.

Thread2 makes its write.

Thread2 now iterates 500,000.

Thread1 wants to write, but has to wait another 2 or 3 seconds for
Thread2...



Lock-Based will really kill a collections performance under iteration mixed
with concurrent pushes and pops.

Real world collections experience iteration, writes and reads. Perfect for
lock-free.


P.S.

Please test some lock-free lists out for yourself...

Then post some test code, just like I did.
SenderX
2004-02-24 22:52:46 UTC
Permalink
Post by SenderX
Here is a set of "pre-coded", easy to read, lock-free garbage collected and
http://groups.google.com/groups?selm=3EF64B1F.8693F354%40xemaps.com
DOH!

That was the test code.

http://groups.google.com/groups?selm=3EF64AA8.180483F2%40xemaps.com

That was the collector and list.

This is some great experimental code!

Many thanks to Joe Seigh.
Peter Lee
2004-02-25 16:47:06 UTC
Permalink
SenderX> Lock-Free:
SenderX> Hummm. Do we have to lock... NOPE!
SenderX> That's good because we have to iterate 500,000 ITEMS!
SenderX> Oh sh^#!!!
SenderX> Thread2 wants to write! No worry...
SenderX> Thread2 writes to the collection while Thread1 iterates.
SenderX> Thread1 writes while Thread2 iterates
SenderX> Lock-Free makes this possible.

SenderX> Lock-Based:
SenderX> Now... Think if we had locks...
SenderX> CRAP!
SenderX> Thread2 would block and wait for Thread1 to finish
SenderX> processing 500,000 items just to make a single write to
SenderX> the collection!
SenderX> Thread2 waits 2 or 3 seconds for Thread1 to finish
SenderX> iterating.
SenderX> Thread2 makes its write.
SenderX> Thread2 now iterates 500,000.
SenderX> Thread1 wants to write, but has to wait another 2 or 3
SenderX> seconds for Thread2...

I have a question about a "lock-free" paper I read yesterday. It
seems to me that lock-free is something of a misnomer. It does spin
after all in certain cases, it's just a bit more optimized. Anyway
here's what I have a question about. Given this:

structure pointer_t {ptr: pointer to node_t, count: unsigned integer}
structure node_t {value: data type, next: pointer_t}
structure queue_t {Head: pointer_t, Tail: pointer_t}

initialize(Q: pointer to queue_t)
node = new node() # Allocate a free node
node->next.ptr = NULL # Make it the only node in the linked list
Q->Head = Q->Tail = node # Both Head and Tail point to it

enqueue(Q: pointer to queue t, value: data type)
E1: node = new node() # Allocate a new node from the free list
E2: node->value = value # Copy enqueued value into node
E3: node->next.ptr = NULL # Set next pointer of node to NULL
E4: loop # Keep trying until Enqueue is done
E5: tail = Q->Tail # Read Tail.ptr and Tail.count together
E6: next = tail.ptr->next # Read next ptr and count fields together
E7: if tail == Q->Tail # Are tail and next consistent?
E8: if next.ptr == NULL # Was Tail pointing to the last node?
E9: if CAS(&tail.ptr->next, next, <node, next.count+1>) # Try to link node at the end of the linked list
E10: break # Enqueue is done. Exit loop
E11: endif
E12: else # Tail was not pointing to the last node
E13: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>) # Try to swing Tail to the next node
E14: endif
E15: endif
E16: endloop
E17: CAS(&Q->Tail, tail, <node, tail.count+1>) # Enqueue is done. Try to swing Tail to the inserted node

The above is from http://www.cs.rochester.edu/u/scott/papers/1996_PODC_queues.pdf

Is it not possible that between E8 and E9 a new tail could have been
added by another producer thread such that the subsequent cas would
result in a node being lost?
SenderX
2004-02-26 01:31:18 UTC
Permalink
Post by Peter Lee
I have a question about a "lock-free" paper I read yesterday. It
seems to me that lock-free is something of a misnomer. It does spin
after all in certain cases, it's just a bit more optimized. Anyway
Lock-free algo do not spin, " while waiting for another thread to set a
certain value ".

Do not confuse lock-free with spin-locks. In lock-free a thread reads a
value and checks if the value is the same. If the CAS fails, it means the
state has already been FULLY modified by some other thread. The next cas
will not/cannot fail on the same state change that cause it to fail in the
first place. This is what we call, Constant Forward Progress. Every CAS
failure means another thread was in and out of the critical section, and it
update is committed.

Spin-locks simply cannot guarantee constant forward progress. In a normal
spin-lock, an atomic failure means that another thread may be working on an
item, it may be in a context-switch, pagefault, ect... The same thread can
spin MANY times WAITING for another thread to finish the exact critical
section that caused it to spin in the first place. This is called waiting.
Not forward progress. This is 100% impossible with lock-free. lock-free
loops overcome all of the baggage that comes with lock-based.
Post by Peter Lee
enqueue(Q: pointer to queue t, value: data type)
E1: node = new node() #
Allocate a new node from the free list
Post by Peter Lee
E2: node->value = value # Copy
enqueued value into node
Post by Peter Lee
E3: node->next.ptr = NULL # Set
next pointer of node to NULL
Post by Peter Lee
E4: loop # Keep
trying until Enqueue is done
Post by Peter Lee
E5: tail = Q->Tail # Read
Tail.ptr and Tail.count together
Post by Peter Lee
E6: next = tail.ptr->next # Read
next ptr and count fields together
Post by Peter Lee
E7: if tail == Q->Tail # Are
tail and next consistent?
Post by Peter Lee
E8: if next.ptr == NULL # Was
Tail pointing to the last node?
Post by Peter Lee
E9: if CAS(&tail.ptr->next, next, <node, next.count+1>) # Try
to link node at the end of the linked list
Post by Peter Lee
E10: break #
Enqueue is done. Exit loop
Post by Peter Lee
E11: endif
E12: else # Tail
was not pointing to the last node
Post by Peter Lee
E13: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>) # Try
to swing Tail to the next node
Post by Peter Lee
E14: endif
E15: endif
E16: endloop
E17: CAS(&Q->Tail, tail, <node, tail.count+1>) #
Enqueue is done. Try to swing Tail to the inserted node
Post by Peter Lee
The above is from
http://www.cs.rochester.edu/u/scott/papers/1996_PODC_queues.pdf
Post by Peter Lee
Is it not possible that between E8 and E9 a new tail could have been
added by another producer thread such that the subsequent cas would
result in a node being lost?
1. E6: Reads NEXT.iNode = 0, NEXT.iAba = 0;

OK?

2. E8: if next.ptr == NULL *** Tail might == Back, empty queue

Ok we did local sync with E6.

3. ROUGE1: Tail now pointes to a new node, Tail might != Back

A rouge thread changed Tail

The value that E6 read is out-of-sync with E8 && iABA = 0;

4. E9: if CAS(&tail.ptr->next, next, <node, next.count+1>) # Try to link
node at the end of the linked list

This will fail because NEXT.iNode != 0 && Next.iAba != 0



P.S.

If you try this out.

Do Not Delete Nodes, unless you have atomic_ptr or a proxy garbage
collector.

Use IBM FreeList, or Windows SList api for a node cache.
SenderX
2004-02-26 01:41:39 UTC
Permalink
Post by SenderX
A rouge thread changed Tail
Doh!

The E6 snapshot that is linked to the E9 Compare has been changed by rouge
thread.

This is what will cause E9 to fail.
David Schwartz
2004-02-26 00:01:36 UTC
Permalink
Post by SenderX
Post by David Schwartz
For example, do you undestand why a
4-CPU system with two threads running on a lock-free collection will slow
the other two CPUs down much more than two threads running a locked
collection will?
Just imagine if one of your two threads (Thread1) decided to iterate the
collection...
Okay.
Post by SenderX
Hummm. Do we have to lock... NOPE!
That's good because we have to iterate 500,000 ITEMS!
Oh sh^#!!!
Thread2 wants to write! No worry...
Thread2 writes to the collection while Thread1 iterates.
Thread1 writes while Thread2 iterates
Lock-Free makes this possible.
Right, with each iteration requiring at least two FSB transactions just
for a single interlocked operation. Good luck to any other CPU trying to run
code through that FSB saturation.
Post by SenderX
Now... Think if we had locks...
CRAP!
Thread2 would block and wait for Thread1 to finish processing 500,000 items
just to make a single write to the collection!
Actually, it would yield the CPU to another thread which wasn't trying
to access that collection, allowing both threads to run at full speed. Any
other thread running doesn't see any contention on the FSB because there
isn't any.
Post by SenderX
Thread2 waits 2 or 3 seconds for Thread1 to finish iterating.
Thread2 waits, but the CPU goes on doing useful work.
Post by SenderX
Thread2 makes its write.
Thread2 now iterates 500,000.
Thread1 wants to write, but has to wait another 2 or 3 seconds for
Thread2...
I'm sorry, do you know of a system where timeslices are "2 or 3
seconds"?

I give up, no matter how hard I try, I have yet to see one shred of
evidence that you *understand* what I'm saying.

DS
SenderX
2004-02-26 01:04:08 UTC
Permalink
Post by David Schwartz
Right, with each iteration requiring at least two FSB transactions just
for a single interlocked operation. Good luck to any other CPU trying to run
code through that FSB saturation.
Wrong!

Each iteration can be the following code:

GarbageAddRef();

pNode = pList->pFront;

while( pNode )
{
pNode = pNode->pNext;
}

GarbageRelease();

So, David. It is 2 atomic operations to iterate 500,00 nodes.

And 1 atomic operation to collect garbage.

My proxy garbage collector, and Joes atomic_ptr can do this.
Post by David Schwartz
I'm sorry, do you know of a system where timeslices are "2 or 3
seconds"?
WTF?

Thread 1 OWNS READ ACCESS.

Thread 2 tries to WRITE and BLOCKS, David! SEE!

Thread 1 iterates and process 500,000 items

Application takes 2-3 seconds to process all those items.

Thread 1 unlocks

Thread 2 unblocks and gains WRITE access



Thread 2 waited for 500,000 items to be processed by the application. 2 - 3
seconds can be expected.
Post by David Schwartz
I give up, no matter how hard I try, I have yet to see one shred of
evidence that you *understand* what I'm saying.
You are so ignorant when it comes to lock-free collections, is painful.

Thread 1 locks for read

Thread 2 concurrently tries to lock for write and is FORCED to BLOCK.

Thread 1 iterates and does 2-3 seconds work

Thread 2 is SCREWED!

Get it???????

:)
SenderX
2004-02-26 01:36:20 UTC
Permalink
Post by SenderX
My proxy garbage collector, and Joes atomic_ptr can do this.
Joes atomic_ptr does a few more atomic operations, but the job that they do
negates the cost.

Joes atomic ptr mixed with my aba-safe code can actually peform garbage
collection in the actual CAS's that make up the very lock-free algo your
using atomic_ptr for!

This is amazing...

Joe is really nice for posting atomic_ptr on this group! Thanx again!
David Schwartz
2004-02-26 02:40:02 UTC
Permalink
Post by SenderX
GarbageAddRef();
pNode = pList->pFront;
while( pNode )
{
pNode = pNode->pNext;
}
GarbageRelease();
Except this is not lock-free. The 'GarbageAddRef' has locked the
collection such that it cannot be changed.
Post by SenderX
So, David. It is 2 atomic operations to iterate 500,00 nodes.
And 1 atomic operation to collect garbage.
My proxy garbage collector, and Joes atomic_ptr can do this.
I'm not sure what you're talking about exactly, but this is not a
lock-free algorithm.
Post by SenderX
Post by David Schwartz
I'm sorry, do you know of a system where timeslices are "2 or 3
seconds"?
WTF?
Thread 1 OWNS READ ACCESS.
Thread 2 tries to WRITE and BLOCKS, David! SEE!
Thread 1 iterates and process 500,000 items
Application takes 2-3 seconds to process all those items.
Thread 1 unlocks
Thread 2 unblocks and gains WRITE access
Thread 2 waited for 500,000 items to be processed by the application. 2 - 3
seconds can be expected.
Except that lock operations are nearly free when you already hold the
lock. So you can trivially unlock and relock the lock each iteration.
Post by SenderX
Post by David Schwartz
I give up, no matter how hard I try, I have yet to see one shred of
evidence that you *understand* what I'm saying.
You are so ignorant when it comes to lock-free collections, is painful.
Are you seriously arguing that your example garbage collection above has
anything to do with "lock-free collections"? You can't even talk about one
thing at a time.
Post by SenderX
Thread 1 locks for read
Thread 2 concurrently tries to lock for write and is FORCED to BLOCK.
Thread 1 iterates and does 2-3 seconds work
Thread 2 is SCREWED!
Get it???????
Except that thread 1 would release and reacquire the lock every time
because reacquiring an uncontested lock is *cheaper* than a lock-free
operation.

I still see no indication that you have any clue what I'm saying, but
I'll try it one more time just in case anyone really thinks you have any
clue.

A lock-free collection requires at least one interlocked operation for
each operation on the collection. If two threads are accessing the
collection, each such access will require at least one FSB transaction as
threads alternate ownership of the cache line.

A corresponding locked collection will quickly deschedule conflicting
threads such that you will almost never have two running threads competing
for the same collection. So when a thread goes to re-lock a collection in a
tight loop, odds are it had the lock last and so its lock operation will not
result in any FSB traffic at all because its processor will own the cache
line.

SenderX please give me some indication that you udnerstand the two
paragraphs above. Anything. How about this, restate my argument in different
works. That would at least prove you understand my argument.

DS
SenderX
2004-02-27 00:16:34 UTC
Permalink
Post by David Schwartz
Post by SenderX
GarbageRelease();
Except this is not lock-free. The 'GarbageAddRef' has locked the
collection such that it cannot be changed.
110% WRONG!!!!!!!!!!

Where in the WORLD did you get this load of CRAP from?

There are no os mutexs or spinlocks of any kind here!!

I posted lock-free test code that has hundreds of threads cocurrently
iterating and pushing and popping using my lock-free garbage collector. You
did not even look at it.

;(
Post by David Schwartz
Post by SenderX
So, David. It is 2 atomic operations to iterate 500,00 nodes.
And 1 atomic operation to collect garbage.
My proxy garbage collector, and Joes atomic_ptr can do this.
I'm not sure what you're talking about exactly, but this is not a
lock-free algorithm.
WOAH!

110% WRONG!!!!!!!!!!

GarbageAddRef()
{
// Read garbage

do
{
// Increment garbage

} while ( ! CAS64( ... ) );
}

// Now we are in a garbage collected, lock-free, environment.

GarbageRelease()
{
// Read garbage

do
{
// Decrement garbage

} while ( ! CAS64( ... ) );
}

LONG_MAX threads can concurrently entry the lock-free garbage collected
region.

You do not have a CLUE on how my lock-free proxy garbage collectors work. I
suggest you study up David.

Here:

http://www.research.ibm.com/K42/

My proxy collector is similar to this one, except it doesn't have to use a
separate thread for a zero count marshal.

Please David! Inform IBM. Tell them that k42 collector is lock-based.
Post by David Schwartz
Are you seriously arguing that your example garbage collection above has
anything to do with "lock-free collections"?
HELL YES I AM.

You can't seem to understand proxy garbage collection, atomic_ptr, RCU, SMR,
or any other lock-free garbage collector out there.

Joe Seigh and I have swapped collector code, and discussed their
inner-workings many times. You should have been reading up.
Post by David Schwartz
Except that thread 1 would release and reacquire the lock every time
because reacquiring an uncontested lock is *cheaper* than a lock-free
operation.
That would starve Thread2 from writing!.

How could thread2 write if thread1 keeps reacquiring read access???
Post by David Schwartz
A lock-free collection requires at least one interlocked operation for
each operation on the collection.
WRONG AGAIN!
Post by David Schwartz
SenderX please give me some indication that you udnerstand the two
paragraphs above. Anything. How about this, restate my argument in different
works. That would at least prove you understand my argument.
But:

Your understanding of lock-free altos, and they way they interface with the
CPU is totally bogus. I have posted several examples, lock-free test code,
and some lock-free garbage collectors that do not behave like you think they
do!

You have yet to post any code using the almighty atomic_per or my proxy
collector to prove your invalid points. I even gave you working skeleton
code to experiment with damn it!
Post by David Schwartz
Post by SenderX
My proxy garbage collector, and Joes atomic_ptr can do this.
I'm not sure what you're talking about exactly, but this is not a
lock-free algorithm.
rofl! This is so funny!

I am going to post this in the lab where some of my friends are helping me
tweak and test the collector.


My proxy garbage collector, and Joes great atomic_ptr, ARE 100% LOCK-FREE
David.

http://www.research.ibm.com/K42/
SenderX
2004-02-27 00:25:35 UTC
Permalink
Post by SenderX
http://www.research.ibm.com/K42/
My proxy collector is similar to this one, except it doesn't have to use a
separate thread for a zero count marshal.
http://groups.google.com/groups?selm=3F2E4165.75ED3D33%40xemaps.com&rnum=1
( look at citeseer tornado/k42 reference )

The paper should explain their proxy GC.
SenderX
2004-02-27 02:38:00 UTC
Permalink
For some reason I think you never like to read the links I post.

The tornado system does use a semi-automatic garbage collector with
highly-granular "lock"-per-object setup. My collector uses the same basic
idea, but it removes the need for "locks" on the garbage collector
internals. They (IBM k42-tornado) have a separate thread that collects
garbage. It checks sum of per_cpu references for zero.

My collector checks sum of per-thread references for zero. Garbage
collection is performed by application threads, just like atomic_ptr. There
is no need for a collector thread. My lock-free collector can be
concurrently referenced on the fly by LONG_MAX #of threads. This allows
threads to enter the garbage collection regions without blocking each other.
Threads toss garbage into the collector with CAS64 and acquire barrier. The
threads then release_dumpgc_and_check_for_zero with CAS64 and release
barrier. The gc system is very fast, and handles chunks of objects with a
single reference count.

AppCore Linux is currently under heavy testing, and its lock-based algos are
outperforming Windows. NPTL seems to make lock-based go faster than Windows.
I think futex's lock-free nature is responsible for that...

;)
SenderX
2004-02-26 01:08:21 UTC
Permalink
Post by David Schwartz
Right, with each iteration requiring at least two FSB transactions just
for a single interlocked operation. Good luck to any other CPU trying to run
code through that FSB saturation.
Remember that lock-free test code I posted?

If you had actually looked at it, you would know that this statement is

bullsh%$!!!
SenderX
2004-02-26 01:16:12 UTC
Permalink
Post by David Schwartz
Right, with each iteration requiring at least two FSB transactions just
for a single interlocked operation. Good luck to any other CPU trying to run
code through that FSB saturation.
Please David.

Post some working code proving this.
Patrick TJ McPhee
2004-02-24 17:18:30 UTC
Permalink
In article <beB_b.386518$***@attbi_s02>, SenderX <***@xxx.com> wrote:
% > % If you go beyond 3-4 threads the lock-free performance starts to softly
% > % degrade.
% >
% > OK, so what happens when you get to the 5000 under discussion?
%
% The same as when you get to thousands of threads under lock-based... It
% starts to degrade...

So, why did you say `Lock-Free algos go faster when the number of threads
increase.' in this context?
--
Patrick TJ McPhee
East York Canada
***@interlog.com
SenderX
2004-02-24 17:45:08 UTC
Permalink
Post by Patrick TJ McPhee
% > % If you go beyond 3-4 threads the lock-free performance starts to softly
% > % degrade.
% >
% > OK, so what happens when you get to the 5000 under discussion?
%
% The same as when you get to thousands of threads under lock-based... It
% starts to degrade...
So, why did you say `Lock-Free algos go faster when the number of threads
increase.' in this context?
Starting at 1 thread per. hyperthreaded cpu, lock-free algo speed
increaces when 2-3 threads are added. Any more per-processor, and it
will slowly degrade.

I should have clarified...
Randy Howard
2004-03-01 18:06:40 UTC
Permalink
Post by SenderX
Post by Patrick TJ McPhee
% > % If you go beyond 3-4 threads the lock-free performance starts to softly
% > % degrade.
% >
% > OK, so what happens when you get to the 5000 under discussion?
%
% The same as when you get to thousands of threads under lock-based... It
% starts to degrade...
So, why did you say `Lock-Free algos go faster when the number of threads
increase.' in this context?
Starting at 1 thread per. hyperthreaded cpu, lock-free algo speed
increaces when 2-3 threads are added. Any more per-processor, and it
will slowly degrade.
Strange, threaded code (I/O bound) will actually run faster on a single HT
CPU by turning HT off. :-) HT is a marketing dream, but not worth much
in practice it would seem for *real* I/O. Similarly, it seems the lock-free
is more trouble than it is worth, particularly if you are interested in
supporting multiple platforms. The code you talk about is painfully linked
to Intel CPUs from what I've seen so far.

I understand fully that you are intensely interested in lock-free research,
but I think it is at least suspect, if not downright disingenuous to
recommend it to everyone that asks a question about threads. When you have
a portable library that compiles cleanly on about 15 different platforms
implementing lock-free threading, be sure and let us know. In the mean
time, pthreads is the way to go if portability is a major concern.
--
Randy Howard
2reply remove FOOBAR
SenderX
2004-03-02 00:06:15 UTC
Permalink
Post by Randy Howard
The code you talk about is painfully linked
to Intel CPUs from what I've seen so far.
Yes. The cmpxchg8b opcode is the culprit. Lock-free algos have to be
specialized for the hardware.

For instance, cmpxchg8b on x86-64 will not work for my collector. I am going
to have to use offsets for x86-64.

PPC has ll/sc. This is will work perfectly, but I have to recode the
collector algo to use it.

Itanium has compare 8 exchange 16, with acquire and release semantics. I
have to recode the collector algo for this opcode.

Alpha has load-locked/store-conditional and cmpeq, I can recode the
collector for this as well.

I have rough drafts ready for most of those processors.

If all else fails, the algos can simply be lock-based...

;)
Post by Randy Howard
I understand fully that you are intensely interested in lock-free research,
but I think it is at least suspect, if not downright disingenuous to
recommend it to everyone that asks a question about threads.
The library will be suitable for production code.

I also post some stuff in the hope that others will test the crap out of it,
trying to crash it.

:O
Post by Randy Howard
When you have
a portable library that compiles cleanly on about 15 different platforms
implementing lock-free threading, be sure and let us know.
15 different CPU's or OS's?

:P
Randy Howard
2004-03-04 05:21:24 UTC
Permalink
Post by SenderX
Post by Randy Howard
The code you talk about is painfully linked
to Intel CPUs from what I've seen so far.
Yes. The cmpxchg8b opcode is the culprit. Lock-free algos have to be
specialized for the hardware.
Which means they're a flash in the pan, at best. Have fun with them.
Post by SenderX
If all else fails, the algos can simply be lock-based...
Of course.
Post by SenderX
Post by Randy Howard
When you have
a portable library that compiles cleanly on about 15 different platforms
implementing lock-free threading, be sure and let us know.
15 different CPU's or OS's?
Both.

Try for starters,
Intel32, AMD32, Intel Itanium, Opteron and Intel Yamhill/CT CPUs:
Window 98, 2000, ME, XP, 2003 Server, etc.
Linux (all common distributions)
Novell Netware 5.1, 6.0 and 6.5
FreeBSD, BSDI, etc.
Sun Solaris
(I would say SCO, but nobody should ever write code for any
SCO platforms again, IMO.)

Xscale, ARM, etc. for Pocket PC/CE variants.

PowerMac on OS X.y for MACs

Sparc for Sun

Tru64, HP-UX, Cray, Silicon Graphics, ....

That should keep you busy for a while...

Or, you could use something portable instead to begin with.
--
Randy Howard
2reply remove FOOBAR
David Butenhof
2004-03-04 12:42:47 UTC
Permalink
Post by Randy Howard
Post by SenderX
Post by Randy Howard
The code you talk about is painfully linked
to Intel CPUs from what I've seen so far.
Yes. The cmpxchg8b opcode is the culprit. Lock-free algos have to be
specialized for the hardware.
Which means they're a flash in the pan, at best. Have fun with them.
I get as annoyed by SenderX's nagging "lock-free is everything" tune as
anyone, and especially annoyed that despite several promises to change his
tune it just keeps playing.

Even so, this isn't fair criticism.

[...]
Post by Randy Howard
Or, you could use something portable instead to begin with.
Not that long ago, threads weren't portable. So we had those saying nobody
can/should use them, and others who worked to define and evangelize a
usable portable specification -- which became POSIX threads, and is now
available "everywhere that matters". (And even a few places that don't
matter. ;-) ) So we can now talk about portable synchronization mechanisms.

And while the POSIX thread APIs are portable, remember, the IMPLEMENTATION
isn't; for exactly the same reasons that SenderX's lock-free packages
aren't portable. (The same, for that matter, is true of any OS kernel.)
They use the same concepts, the same instructions, for the same reasons.
There's no reason that SenderX's APIs (or something similar) can't be just
as portable -- even though the implementation will always differ across
platforms.

SenderX is trying to push into a "second stage" area. While lock-free
algorithms have been subject to research and development for decades, the
pervasive use of threads means we've reached the point where there ought to
be more interest in learning about them and how to apply them to real
tasks. They're NOT standard, or fully portable, YET; but they can be, and
should be. And just because he's there a little before nearly everyone else
doesn't mean what he's doing isn't worthwhile.

I don't know yet how his work can or ought to feed into standard interfaces,
or whether perhaps such packages (his or others) might simply become an API
specification as part of ANSI C++ 0x, for example, with lock-free
implementation available as one reference library -- just as many
implementation of STL trace their lineage back to one or two historical
packages.

The problem with SenderX is that he still wastes too much time and energy
trying to convince people who know better that lock-free is everything.
That he doesn't yet have a time-tested and bullet-proof implementation for
every possible system, however, is a red herring at best. He's got some
good ideas, and I have no reason to suspect that his implementation is
inherently bad. (Though I'm not convinced they're quite as solid as he
boasts, simply because his testing methodology appears to amount to little
more than firing off test cases and crossing his fingers.)

Even so, if you're trying to do scalable and/or high performance server work
with threads, and you're running into contention problems, you should be
carefully examining the concepts and algorithms in SenderX's package and
others like it; even if you don't want to trust the actual code. This IS
good stuff, even if it's not quite the answer to "life, the universe, and
everything". Combined judiciously with locking and scheduling controls,
lock-free mechanisms really can often help you to improve concurrency and
resource utilization. (You'll find some use of lock-free mechanisms in any
thread library or OS kernel.)

And if you need lock-free, and the code you want to use doesn't already run
where you want it, perhaps it's worth your time to look it over and decide
whether to help with porting and testing.
--
/--------------------[ ***@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\----[ http://homepage.mac.com/dbutenhof/Threads/Threads.html ]---/
SenderX
2004-03-04 12:49:20 UTC
Permalink
I have working assembler libraries for the first 4, but you forgot PPC!

I have to admit that the Itanium's weird cmp8xchg16 opcode was a bit hard
getting used to.
Post by SenderX
Yes. The cmpxchg8b opcode is the culprit. Lock-free algos have to be
specialized for the hardware.
Which means they're a flash in the pan, at best. Have fun with them
Interesting.

You must think coding to the x86-32/64 instruction-set is a waste of time?

I think not.

You think coding to the PPC instruction-set is a waste of time?

I don't think so!

You must not be an assembler person!

:)
Post by SenderX
That should keep you busy for a while...
Or, you could use something portable instead to begin with.
I can most likely use all of the OS's that have Pthreads or Windows threads,
gcc+gas, and runs on any CPU that I have a working assembler library for.
Its not that difficult to get the lock-free assembler code fleshed out for a
particular instruction set. Its the same algo for every CPU, you just have
to use different opcodes.


P.S.

Windows already has portable API's for all types of lock-free programming.
MS also has portable acquire/release barriers.
Alexander Terekhov
2004-03-04 14:20:03 UTC
Permalink
SenderX wrote:
[...]
Post by SenderX
Windows already has portable API's for all types of lock-free programming.
MS also has portable acquire/release barriers.
Not quite/yet. But given http://www.theinquirer.net/?article=14407
they ought to come up with something atomic<>-ish (plain old C
incarnation(s) including) in the "near future".

regards,
alexander.
David Butenhof
2004-03-04 18:04:19 UTC
Permalink
Post by SenderX
Windows already has portable API's for all types of lock-free programming.
MS also has portable acquire/release barriers.
This may work for YOUR definition of "portable" (e.g., proprietary feature
of a proprietary OS that happens to have monopolistic market penetration),
but it doesn't fit anywhere within the useful spectrum of any more
objective definitions of "portable".

Nevertheless, in the larger sense, your APIs (or other APIs of similar
nature and scope) can and eventually will BECOME the portable API for doing
stuff like this. (De facto if not de jure.) The fact that the
IMPLEMENTATION of that portable API is OS/hardware specific will be no more
relevant than the fact that the Linux kernel contains variant files and
compile conditionals for different hardware. That stuff matters only when
you have to take the API implementation with you.
--
/--------------------[ ***@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\----[ http://homepage.mac.com/dbutenhof/Threads/Threads.html ]---/
SenderX
2004-03-07 08:39:04 UTC
Permalink
Post by David Butenhof
This may work for YOUR definition of "portable" (e.g., proprietary feature
of a proprietary OS that happens to have monopolistic market penetration),
but it doesn't fit anywhere within the useful spectrum of any more
objective definitions of "portable".
Its very portable to different cpu's, not to different OS's.

SenderX
2004-03-02 13:58:58 UTC
Permalink
Post by Randy Howard
Post by SenderX
Post by Patrick TJ McPhee
% > % If you go beyond 3-4 threads the lock-free performance starts to softly
% > % degrade.
% >
% > OK, so what happens when you get to the 5000 under discussion?
%
% The same as when you get to thousands of threads under lock-based... It
% starts to degrade...
So, why did you say `Lock-Free algos go faster when the number of threads
increase.' in this context?
Starting at 1 thread per. hyperthreaded cpu, lock-free algo speed
increaces when 2-3 threads are added. Any more per-processor, and it
will slowly degrade.
Strange, threaded code (I/O bound) will actually run faster on a single HT
CPU by turning HT off. :-) HT is a marketing dream, but not worth much
in practice it would seem for *real* I/O. Similarly, it seems the lock-free
is more trouble than it is worth, particularly if you are interested in
supporting multiple platforms. The code you talk about is painfully linked
to Intel CPUs from what I've seen so far.
I understand fully that you are intensely interested in lock-free research,
but I think it is at least suspect, if not downright disingenuous to
recommend it to everyone that asks a question about threads. When you have
a portable library that compiles cleanly on about 15 different platforms
implementing lock-free threading, be sure and let us know. In the mean
time, pthreads is the way to go if portability is a major concern.
One nice thing about my library is that you are not required to use the
threading functions it exposes.

You have to add a call to a startup and shutdown function in your
program entry point and then you can use the librays lock-free systems
with pthreads, or windows threads. Or any other threading library...
Loading...