Discussion:
C++ multithreading: mem_pool
(too old to reply)
Sergey P. Derevyago
2008-08-06 15:31:44 UTC
Permalink
* I've written a C++ multithreading tutorial. Here it is if you can read
Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
* The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
English comments.
* Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html

I have no time to create a full English translation so I'm going to
describe only several important issues.

First of all: Use thread-private memory allocators wherever possible
because typical C++ operators new/delete ARE TERRIBLY SLOW! New/delete
don't allow for concurrent or simultaneous execution so they should not
be blindly used in well designed MT programs!

Please take a look at example1/main.cpp living at
http://ders.stml.net/cpp/mtprog/mtprog.html#3.1
-----------------------------------8<-----------------------------------
void start_std(void*)
{
list<int> lst;
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) lst.push_back(j);
for (int j=0; j<M; j++) lst.pop_front();
}
}
void start_ders(void*)
{
mem_pool mp;
stl_allocator<int> alloc(mp);

list<int, stl_allocator<int> > lst(alloc);
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) lst.push_back(j);
for (int j=0; j<M; j++) lst.pop_front();
}
}
-----------------------------------8<-----------------------------------
The table at http://ders.stml.net/cpp/mtprog/mtprog.html#3.1.1 shows
that start_ders() function is tens and hundreds (!!!) of times faster:
1 CPU : 38.1
1 CPU,HT : 42.1
2 CPU,SMP: 321.6 (!!!)
2 CPU,HT : 37.0

Please don't say IMPOSSIBLE! ;) Just download code.zip and try
mtprog\examples\example1\main.cpp yourself: with your compiler and box.
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Dmitriy V'jukov
2008-08-06 16:51:51 UTC
Permalink
Post by Sergey P. Derevyago
* I've written a C++ multithreading tutorial. Here it is if you can read
Russian:http://ders.stml.net/cpp/mtprog/mtprog.html.
* The source code ishttp://ders.stml.net/cpp/mtprog/code.zipand it has
English comments.
* Doxygen ishttp://ders.stml.net/cpp/mtprog/doc/index.html
You conclude that Hyper-Threaded processor is worse than non Hyper-
Threaded processor. Have you applied optimization techniques described
in "Intel 64 and IA-32 Architectures Optimization Reference Manual"?
Most notably:
- Eliminate 64-KByte Aliased Data Accesses
- Preventing Excessive Evictions in First-Level Data Cache
- Using Shared Execution Resources in a Processor Core
I think those can have great impact.


Dmitriy V'jukov
Sergey P. Derevyago
2008-08-06 17:08:27 UTC
Permalink
Post by Dmitriy V'jukov
You conclude that Hyper-Threaded processor is worse than non Hyper-
Threaded processor.
Not exact.
From the _scalability_ POV one HT CPU is worse then one non-HT CPU for
_this kind of tasks._
Post by Dmitriy V'jukov
Have you applied optimization techniques described
in "Intel 64 and IA-32 Architectures Optimization Reference Manual"?
There was no point to explicitly code for HT processors.
HT quirks are out of scope.
Nevertheless, please feel free to apply any optimizations you want and
report the results...
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Dmitriy V'jukov
2008-08-12 10:54:42 UTC
Permalink
Post by Sergey P. Derevyago
Post by Dmitriy V'jukov
You conclude that Hyper-Threaded processor is worse than non Hyper-
Threaded processor.
Not exact.
From the _scalability_ POV one HT CPU is worse then one non-HT CPU for
_this kind of tasks._
Post by Dmitriy V'jukov
Have you applied optimization techniques described
in "Intel 64 and IA-32 Architectures Optimization Reference Manual"?
There was no point to explicitly code for HT processors.
HT quirks are out of scope.
Nevertheless, please feel free to apply any optimizations you want and
report the results...
You benchmark on SMP system, and you especially optimize for SMP
system (this is the main purpose of the allocator). But you also
benchmark on HT system, but didn't optimize for it. Well, for me it
looks quite dishonestly. To get honest results and make honest
conclusions you must either optimize for HT too or exclude it from
test. Imho.

Dmitriy V'jukov
Sergey P. Derevyago
2008-08-12 11:11:00 UTC
Permalink
Post by Dmitriy V'jukov
You benchmark on SMP system, and you especially optimize for SMP
system (this is the main purpose of the allocator). But you also
benchmark on HT system, but didn't optimize for it.
It's not the case, sorry. I've written a C++ multithreading TUTORIAL.

A tutorial with a "one size fits all" example that allows for
simultaneous execution.

That's it! There was no point to explicitly optimize for HT...
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 11:24:43 UTC
Permalink
Post by Sergey P. Derevyago
Post by Dmitriy V'jukov
You benchmark on SMP system, and you especially optimize for SMP
system (this is the main purpose of the allocator). But you also
benchmark on HT system, but didn't optimize for it.
It's not the case, sorry. I've written a C++ multithreading TUTORIAL.
A tutorial with a "one size fits all" example that allows for simultaneous
execution.
That's it! There was no point to explicitly optimize for HT...
Here is something that could help improve HT performance in your benchmarks:

http://softwarecommunity.intel.com/Wiki/Multi-threadappsforMulti-core/487.htm

Intel had nasty bug in early HT processors. You can fix it by assigning a
thread a monotonically increasing id and multiplying that by 64 and then use
that in a call to alloca. Here is some more on that:

http://groups.google.com/group/comp.programming.threads/msg/b1a5cdeb5ea55a05
Chris M. Thomasson
2008-08-07 01:22:13 UTC
Permalink
Post by Sergey P. Derevyago
* I've written a C++ multithreading tutorial. Here it is if you can read
Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
* The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
English comments.
* Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html
I have no time to create a full English translation so I'm going to
describe only several important issues.
First of all: Use thread-private memory allocators wherever possible
because typical C++ operators new/delete ARE TERRIBLY SLOW! New/delete
don't allow for concurrent or simultaneous execution so they should not be
blindly used in well designed MT programs!
100% totally agreed. However, new/delete can be hooked up to a memory
allocator that is based on nothing but per-thread heaps. Indeed we have
discussed this before:

http://groups.google.com/group/comp.programming.threads/msg/bc019dc04362be41

http://groups.google.com/group/comp.programming.threads/msg/1d3aeaa6ebf5181f

http://groups.google.com/group/comp.programming.threads/msg/907573c0e81dba56

http://groups.google.com/group/comp.programming.threads/msg/f5ea5a977a156a03
(I finally got you to agree that my design makes sense... ;^)

http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855

Anyway, your design makes sense. There is one question... Can one thread
allocate memory block, pass it to another which ultimately frees it? If not,
then how are you going to handle producer/consumer?

[...]
Sergey P. Derevyago
2008-08-07 08:28:50 UTC
Permalink
Post by Chris M. Thomasson
Can one thread
allocate memory block, pass it to another which ultimately frees it? If
not, then how are you going to handle producer/consumer?
1. Thread-private allocator
http://ders.stml.net/cpp/mtprog/doc/classders_1_1mem__pool.html means
that its memory blocks doesn't cross thread boundaries. I.e. you can't
get a block from Alloc1 in Thr1 and free it in Thr2.
2. The common solution is serialization through synchronized queues
http://ders.stml.net/cpp/mtprog/doc/classders_1_1data__queue.html :
serialize your data in Thr1 and reconstruct it back in Thr2 using its
Alloc2.
3. Yes, I understand that there can be really big data chunks to pass
between threads. The obvious solution is to use usual new/delete to
allocate this big data and pass a pointer rather then serialize the
whole memory block.

BTW I've written several real-world application using derslib. Just
download http://ders.stml.net/cpp/mtprog/code.zip and look at (sorted by
complexity)
mtprog\src\mtprog\mtftext : just a tutorial sample of trivial grep
mtprog\src\mtprog\mtcksrc : checks source code files
mtprog\src\mtprog\mtdel : deletes files by mask[s]
mtprog\src\mtprog\mtcnvsrc : converts source code files
mtprog\src\mtprog\mtdirdiff: compares directories

The sources are well-written so you'll get no problems to see what's
going on.
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-07 10:30:54 UTC
Permalink
Post by Sergey P. Derevyago
Can one thread allocate memory block, pass it to another which ultimately
frees it? If not, then how are you going to handle producer/consumer?
1. Thread-private allocator
http://ders.stml.net/cpp/mtprog/doc/classders_1_1mem__pool.html means that
its memory blocks doesn't cross thread boundaries. I.e. you can't get a
block from Alloc1 in Thr1 and free it in Thr2.
2. The common solution is serialization through synchronized queues
serialize your data in Thr1 and reconstruct it back in Thr2 using its
Alloc2.
Okay. Can I use my own queue algorithm to create an efficient
producer/consumer pattern using your mem-pool? Something like:

thread-1
-----------------------------------
void* mem = your_malloc(...);
my_queue_push(..., mem);


thread-2
-----------------------------------
void* mem = my_queue_pop_wait(...);
your_free(mem);

?





P.S.

Try not to flame me if I am wrong. I have not downloaded your source code
yet. I have limited time right now.

;^(...

[...]
Sergey P. Derevyago
2008-08-07 13:36:19 UTC
Permalink
Post by Chris M. Thomasson
Okay. Can I use my own queue algorithm to create an efficient
thread-1
-----------------------------------
void* mem = your_malloc(...);
my_queue_push(..., mem);
thread-2
-----------------------------------
void* mem = my_queue_pop_wait(...);
your_free(mem);
?
No, you can't.
The point is that there is no your_malloc() ;)

Look,
1. I have a class: ders::mem_pool
2. You can create an instance: mem_pool mp;
3. And use it to get some memory: void* prt=mp.allocate(123);

Class mem_pool is _intentionally_ non-synchronized, it supposed to be
used as a thread-private mem_pool!

Well, you can say that ubiquitous mem_pool& mp references are really
boring. Yes, you are right: thread-private mem_pools have their price.
And the price is: your applications will run tens and even hundreds of
times faster!
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-08 01:24:13 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
Okay. Can I use my own queue algorithm to create an efficient
thread-1
-----------------------------------
void* mem = your_malloc(...);
my_queue_push(..., mem);
thread-2
-----------------------------------
void* mem = my_queue_pop_wait(...);
your_free(mem);
?
No, you can't.
The point is that there is no your_malloc() ;)
Look,
1. I have a class: ders::mem_pool
2. You can create an instance: mem_pool mp;
3. And use it to get some memory: void* prt=mp.allocate(123);
Class mem_pool is _intentionally_ non-synchronized, it supposed to be used
as a thread-private mem_pool!
Fine. But, as we have discussed before, there is an algorithm in which
memory from private thread pool A can be freed on private thread pool B
while preserving all the performance benefits of local unsynchronized
allocation. Basically, there is no synchronization for thread local
allocations and deallocations. However, for remote deallocations there is
lock-free operation.
Post by Sergey P. Derevyago
Well, you can say that ubiquitous mem_pool& mp references are really
boring. Yes, you are right: thread-private mem_pools have their price. And
the price is: your applications will run tens and even hundreds of times
faster!
I was thinking along the lines of a general purpose allocation (e.g.,
malloc/free) which to completely based on thread-private memory pools.
Sergey P. Derevyago
2008-08-08 06:47:06 UTC
Permalink
Post by Chris M. Thomasson
Fine. But, as we have discussed before, there is an algorithm in which
memory from private thread pool A can be freed on private thread pool B
while preserving all the performance benefits of local unsynchronized
allocation. Basically, there is no synchronization for thread local
allocations and deallocations. However, for remote deallocations there
is lock-free operation.
Lock-free doesn't ALWAYS mean fast, you know.
Could you please give a direct link to "preserving all the performance
benefits of local unsynchronized allocation"? I have some doubts...
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-09 00:08:39 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
Fine. But, as we have discussed before, there is an algorithm in which
memory from private thread pool A can be freed on private thread pool B
while preserving all the performance benefits of local unsynchronized
allocation. Basically, there is no synchronization for thread local
allocations and deallocations. However, for remote deallocations there is
lock-free operation.
Lock-free doesn't ALWAYS mean fast, you know.
Sure.
Post by Sergey P. Derevyago
Could you please give a direct link to "preserving all the performance
benefits of local unsynchronized allocation"? I have some doubts...
Since I don't think you trust me or Dmitriy:

http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf

Please don't say that smart mem pools are impossible. You just end up making
it seem like you don't have any idea on how state-of-the-art memory
allocators actually work!
Sergey P. Derevyago
2008-08-11 08:35:39 UTC
Permalink
Post by Chris M. Thomasson
http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
I need some time to review this paper. Wait please...
Post by Chris M. Thomasson
Please don't say that smart mem pools are impossible.
I'm just asking to plug _real_ "smart mem pool" to my example1 and show
the results.
Is it impossible? Why?
Post by Chris M. Thomasson
You just end up
making it seem like you don't have any idea on how state-of-the-art
memory allocators actually work!
Does ismm06.pdf contain the necessary description?
Something else to review?
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 03:39:11 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
I need some time to review this paper. Wait please...
Post by Chris M. Thomasson
Please don't say that smart mem pools are impossible.
I'm just asking to plug _real_ "smart mem pool" to my example1 and show
the results.
Is it impossible? Why?
You can plug the following allocator into your algorihtms:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html

because it can overload the global new/delete operators. When I get some
free time, I will go ahead and do so. I think the numbers will be more
reliable of multiple parties join in.
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
You just end up making it seem like you don't have any idea on how
state-of-the-art memory allocators actually work!
Does ismm06.pdf contain the necessary description?
Something else to review?
AFAICT, the paper explains things quite well indeed.
Sergey P. Derevyago
2008-08-12 08:24:52 UTC
Permalink
Post by Chris M. Thomasson
http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
Unfortunately it requires i686-specific compilers.
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 09:52:27 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
Unfortunately it requires i686-specific compilers.
Yup. That's a major caveat indeed!
Sergey P. Derevyago
2008-08-20 10:46:26 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
I need some time to review this paper. Wait please...
Well, the paper has been reviewed. The main impression is that it's of
average quality, nothing interesting.
The presented Streamflow allocator has no chance to serve as a _general
purpose_ C++ allocator: it
"is implemented on top of Linux 2.6, which provides support for
superpages via a virtual filesystem. Streamflow allocates superpages by
creating files in the virtual filesystem, and mapping these files in
whole or in part to virtual memory."

On the contrary, my mem_pool doesn't relie on any OS specific features
and can be used in ordinary single-threaded environment without any
penalty. There is no thread-specific overhead in the code!

So the main points are:
1. Design your application to allow for simultaneous execution. I.e.
don't allocate/free the memory across the thread bounds (if you can't
prove that it really makes sense _for this particular task_).
2. Then use mem_pool as a thread-private (sometimes, data-private)
allocator to greatly improve the performance.

In essence, non-synchronized mem_pool allocator is built on top of
ordinary new/delete synchronized operator so the combination of
mem_pool+new is really used. On the contrary, Streamflow tries to
replace not the mem_pool allocator itself by the whole mem_pool+new
combination!
So if you're going to compare mem_pool to some other allocator please
make sure that it uses the same synchronized "new part" of the
mem_pool+new combination. I.e. it's not quite fair to compare
mem_pool+"old bad new" with heavily optimized SuperPuperAllocator: you
should cut the appropriate synchronized part of SuperPuperAllocator and
plug it to mem_pool instead of "old bad new" to see the difference
mem_pool can give.
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-21 11:10:59 UTC
Permalink
Post by Sergey P. Derevyago
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
I need some time to review this paper. Wait please...
Well, the paper has been reviewed. The main impression is that it's of
average quality, nothing interesting.
The presented Streamflow allocator has no chance to serve as a _general
ROFL!

Ummm, you can use StreamFlow to replace new/delete, just like vzmem region
allocator. You need to actually understand the paper. Your statement seems
to suggest that you don't know what the heck your talking about! Your brain
can't seem to be able to wrap its mind around the fact that general-purpose
allocators can be build with per-thread memory pools...

I have had extensive conversations with the implementors of StreamFlow, and
can tell you to either talk to them, learn about it, or stop spewing
non-sense! Learn something for once! Your allocator is fine. However, there
are general-purpose allocators that are just as fast and OH SO MUCH MORE
FLEXIBLE!

If you actually read the paper, you would realize that StreamFlow works on
platforms with a certain version of CAS... What type of CAS does it need?
Don't you dare think is needs pointers in that CAS. Give me an answer and
prove you actually understand what you read!!!!!!!

;^/

[...]
Sergey P. Derevyago
2008-08-21 11:44:14 UTC
Permalink
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
I need some time to review this paper. Wait please...
Well, the paper has been reviewed. The main impression is that it's of
average quality, nothing interesting.
The presented Streamflow allocator has no chance to serve as a
ROFL!
Ummm, you can use StreamFlow to replace new/delete, just like vzmem
region allocator. You need to actually understand the paper. Your
statement seems to suggest that you don't know what the heck your
talking about! Your brain can't seem to be able to wrap its mind around
the fact that general-purpose allocators can be build with per-thread
memory pools...
I have had extensive conversations with the implementors of StreamFlow,
and can tell you to either talk to them, learn about it, or stop spewing
non-sense! Learn something for once! Your allocator is fine. However,
there are general-purpose allocators that are just as fast and OH SO
MUCH MORE FLEXIBLE!
If you actually read the paper, you would realize that StreamFlow works
on platforms with a certain version of CAS... What type of CAS does it
need? Don't you dare think is needs pointers in that CAS. Give me an
answer and prove you actually understand what you read!!!!!!!
No communication after this point!

--
Chris M. Thomasson
2008-08-21 11:59:42 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
I need some time to review this paper. Wait please...
Well, the paper has been reviewed. The main impression is that it's of
average quality, nothing interesting.
The presented Streamflow allocator has no chance to serve as a _general
ROFL!
Ummm, you can use StreamFlow to replace new/delete, just like vzmem
region allocator. You need to actually understand the paper. Your
statement seems to suggest that you don't know what the heck your talking
about! Your brain can't seem to be able to wrap its mind around the fact
that general-purpose allocators can be build with per-thread memory
pools...
I have had extensive conversations with the implementors of StreamFlow,
and can tell you to either talk to them, learn about it, or stop spewing
non-sense! Learn something for once! Your allocator is fine. However,
there are general-purpose allocators that are just as fast and OH SO MUCH
MORE FLEXIBLE!
If you actually read the paper, you would realize that StreamFlow works
on platforms with a certain version of CAS... What type of CAS does it
need? Don't you dare think is needs pointers in that CAS. Give me an
answer and prove you actually understand what you read!!!!!!!
No communication after this point!
You make a totally ridiculous claim about StreamFlow. Period.
Chris M. Thomasson
2008-08-21 12:38:39 UTC
Permalink
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
I need some time to review this paper. Wait please...
Well, the paper has been reviewed. The main impression is that it's of
average quality, nothing interesting.
The presented Streamflow allocator has no chance to serve as a _general
ROFL!
Ummm, you can use StreamFlow to replace new/delete, just like vzmem
region allocator. You need to actually understand the paper. Your
statement seems to suggest that you don't know what the heck your
talking about! Your brain can't seem to be able to wrap its mind around
the fact that general-purpose allocators can be build with per-thread
memory pools...
I have had extensive conversations with the implementors of StreamFlow,
and can tell you to either talk to them, learn about it, or stop spewing
non-sense! Learn something for once! Your allocator is fine. However,
there are general-purpose allocators that are just as fast and OH SO
MUCH MORE FLEXIBLE!
If you actually read the paper, you would realize that StreamFlow works
on platforms with a certain version of CAS... What type of CAS does it
need? Don't you dare think is needs pointers in that CAS. Give me an
answer and prove you actually understand what you read!!!!!!!
No communication after this point!
You make a totally ridiculous claim about StreamFlow. Period.
Your allocator works fine, and is a wonderful item for any programmer to
have in their tool box indeed. The only thing that irks me is when you
continue to claim that general-purpose allocation is dog slow.
Chris M. Thomasson
2008-08-21 11:14:47 UTC
Permalink
Post by Sergey P. Derevyago
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
I need some time to review this paper. Wait please...
Well, the paper has been reviewed. The main impression is that it's of
average quality, nothing interesting.
The presented Streamflow allocator has no chance to serve as a _general
purpose_ C++ allocator: it
"is implemented on top of Linux 2.6, which provides support for superpages
via a virtual filesystem. Streamflow allocates superpages by creating
files in the virtual filesystem, and mapping these files in whole or in
part to virtual memory."
[...]

I have implemented StreamFlow algorithm for fun under with Windows, and
Solaris (x86 and SPARC, x86-32 and SPARC 32/64 versions). You don't know
what your writing about. If I can read your mind, I think you seem to be
under the impression that you can't represent a superblock in their
lock-free algorithm? On the contrary, it makes it far easier because you can
use alignment trick.

Sorry for being so harsh.
Chris M. Thomasson
2008-08-21 11:29:01 UTC
Permalink
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
I need some time to review this paper. Wait please...
Well, the paper has been reviewed. The main impression is that it's of
average quality, nothing interesting.
The presented Streamflow allocator has no chance to serve as a _general
purpose_ C++ allocator: it
"is implemented on top of Linux 2.6, which provides support for
superpages via a virtual filesystem. Streamflow allocates superpages by
creating files in the virtual filesystem, and mapping these files in
whole or in part to virtual memory."
[...]
I have implemented StreamFlow algorithm for fun under with Windows, and
Solaris (x86 and SPARC, x86-32 and SPARC 32/64 versions). You don't know
what your writing about. If I can read your mind, I think you seem to be
under the impression that you can't represent a superblock in their
lock-free algorithm? On the contrary, it makes it far easier because you
can use alignment trick.
Sorry for being so harsh.
Well, the fact that several general-purpose allocators (Hoard, StreamFlow,
vzoom slab, vzoom region, ect...) exist which are based on per-thread memory
allocation schemes seems to blow your entire theory out the water; right? I
am really sick and tired of hearing that general purpose allocation is
non-scaleable crap because its simply NOT TRUE! You are totally misleading
everybody who reads your writings; that is NOT cool in any way shape or
form. You fail to point to any state-of-the-art memory allocation algorihtms
even though you knew about them as far back as 2006! That is not very
professional. Read here; indeed we have discussed this before a couple of
years ago:

http://groups.google.com/group/comp.programming.threads/msg/bc019dc04362be41

http://groups.google.com/group/comp.programming.threads/msg/1d3aeaa6ebf5181f

http://groups.google.com/group/comp.programming.threads/msg/907573c0e81dba56

http://groups.google.com/group/comp.programming.threads/msg/f5ea5a977a156a03
(I finally got you to agree that my design makes sense... ;^)

http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855


How does your theory that general-purpose memory allocation is crap and of
not really useful in multi-threaded applications bind with the contents of
the links above? Also, ever heard of Hoard?

:^|
Sergey P. Derevyago
2008-08-21 11:53:07 UTC
Permalink
Post by Sergey P. Derevyago
Well, the paper has been reviewed. The main impression is that it's
of average quality, nothing interesting.
One can donwload the paper and look at the graphs and tables at pp.
9-10: Streamflow doesn't show exceptional quality.
Post by Sergey P. Derevyago
The presented Streamflow allocator has no chance to serve as a
_general purpose_ C++ allocator: it
The point is that well-designed MT application with the combination of
"mem_pool+new" will always beat "just new" for any possible "new".
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-21 12:14:04 UTC
Permalink
Post by Sergey P. Derevyago
Post by Sergey P. Derevyago
Well, the paper has been reviewed. The main impression is that it's
of average quality, nothing interesting.
Streamflow doesn't show exceptional quality.
You don't like its performance; fine. Well, what about Hoard? It uses
per-thread heaps as well. The fact is that your statement is totally bogus
wrt comparing it to your strictly single-threaded scheme because its
benchmarks compares it to other multi-threaded allocators. You cannot even
begin to think about comparing your allocator to StreamFlow benchmarks, its
flexible enough to be used as a replacement for malloc/free/new/delete and
yours is simply not. Yet if you used StreamFlow, Hoard, vzoom slab or vzoom
region in a strictly single-threaded application, well, the performance
between the general-purpose allocators and yours is going to be equal, or
perhaps yours might be 50-100 milliseconds faster. Big deal. We live in a MT
world today and single-thread only allocators are fine, but they are not
going to take off. How can I plug your allocator into an existing
application? Sorry, I cannot because it assumes the allocator sub-system
knows about the existence of threading.
Post by Sergey P. Derevyago
Post by Sergey P. Derevyago
The presented Streamflow allocator has no chance to serve as a
_general purpose_ C++ allocator: it
The point is that well-designed MT application with the combination of
"mem_pool+new" will always beat "just new" for any possible "new".
You totally miss the point! Let me correct your misleading statement:



Sergey simply seems to refuse to admit that a "well-designed" MT application
used in clever conjunction with a general-purpose multi-threaded allocator
which is based on per-thread memory pools will usually beat one that is
using a poorly designed multi-threaded allocator.


Also, you seem to fail to consistently mention that your scheme has several
caveats when you compare it to multi-threaded allocator... Humm, if a
well-designed MT application uses a producer/consumer pattern, well, its
going to have to serialize every single production/consumption pair. This
can be a serious performance penalty indeed.



I advise you to quite misleading people who don't know any better. Your not
being intellectually honest to say the least. I think you know better!

:^|
Chris M. Thomasson
2008-08-09 00:11:41 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
Fine. But, as we have discussed before, there is an algorithm in which
memory from private thread pool A can be freed on private thread pool B
while preserving all the performance benefits of local unsynchronized
allocation. Basically, there is no synchronization for thread local
allocations and deallocations. However, for remote deallocations there is
lock-free operation.
Lock-free doesn't ALWAYS mean fast, you know.
Could you please give a direct link to "preserving all the performance
benefits of local unsynchronized allocation"? I have some doubts...
http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
Sergey P. Derevyago
2008-08-11 08:38:42 UTC
Permalink
Post by Chris M. Thomasson
http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
Could you please plug your algorithm to my example1 and show the results?
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 03:40:07 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
Could you please plug your algorithm to my example1 and show the results?
I will when I get some time. Hopefully, you can expect some results within a
week.
Chris M. Thomasson
2008-08-12 04:02:36 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
Could you please plug your algorithm to my example1 and show the results?
Please excuse my ignorance wrt your library, but how do I build example1 for
a windows platform with mingw?
Chris M. Thomasson
2008-08-12 04:23:28 UTC
Permalink
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
Could you please plug your algorithm to my example1 and show the results?
Please excuse my ignorance wrt your library, but how do I build example1
for a windows platform with mingw?
I run `mtprog/examples/example1/mogcc.bat' and get the following files:

`mtprog/examples/example1/out.gcc/main1'
`mtprog/examples/example1/out.gcc/main2'


I run:

./main1

and get no output. No feedback. nothing.



I run:

./main2

and get shi%load of errors.



What do I do with those on a windows mingw/cygwin platform? How do I
generate your lib*.a or lib*.so files for `derslib'?

I have cygwin special gcc 3.4.4

and mingw special 3.4.5



Should I just do this on my Solaris or Linux Box?



My main question:

I need the .so and/or .a files for your derslib library. I don't want to
waste any more time.
Sergey P. Derevyago
2008-08-12 08:32:42 UTC
Permalink
Post by Chris M. Thomasson
I need the .so and/or .a files for your derslib library. I don't want to
waste any more time.
There is no .so and/or .a files, sorry.
The point is that it's a TUTORIAL, so the reader is expected just to
run simple .bat file rather then install and tune specific build systems.
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 04:44:57 UTC
Permalink
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
Could you please plug your algorithm to my example1 and show the results?
Please excuse my ignorance wrt your library, but how do I build example1
for a windows platform with mingw?
Never mind, I got it to compile and link. Anyway, I already found a bug in
your main.cpp file of example1...

You include stdio.h in a cpp program. this is wrong; you need

<cstdio>

also, afaict the command line goes like this right;


number_of_threads std_or_ders

where std == 0, and ders == 1


right?
Chris M. Thomasson
2008-08-12 04:48:00 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
Could you please plug your algorithm to my example1 and show the results?
Please excuse my ignorance wrt your library, but how do I build example1
for a windows platform with mingw?
Never mind, I got it to compile and link. Anyway, I already found a bug in
your main.cpp file of example1...
You include stdio.h in a cpp program. this is wrong; you need
<cstdio>
also, afaict the command line goes like this right;
number_of_threads std_or_ders
where std == 0, and ders == 1
Okay, again, please excuse my totally ignorance, but a valid command lines
are as follows:


4 std
4 ders

7 std
7 ders



first param is thread count, second is library.


I got it now.
Sergey P. Derevyago
2008-08-12 08:39:32 UTC
Permalink
Post by Chris M. Thomasson
Never mind, I got it to compile and link. Anyway, I already found a bug
in your main.cpp file of example1...
Great! :)
Post by Chris M. Thomasson
You include stdio.h in a cpp program. this is wrong; you need
<cstdio>
Not, really.
<stdio.h> is really legal: no one is going to ban POSIX.
Post by Chris M. Thomasson
also, afaict the command line goes like this right;
number_of_threads std_or_ders
where std == 0, and ders == 1
@out.gcc\main 1 std
@out.gcc\main 2 std
@out.gcc\main 4 std
@out.gcc\main 8 std
@out.gcc\main 16 std
@out.gcc\main 32 std
@out.gcc\main 64 std
@out.gcc\main 1 ders
@out.gcc\main 2 ders
@out.gcc\main 4 ders
@out.gcc\main 8 ders
@out.gcc\main 16 ders
@out.gcc\main 32 ders
@out.gcc\main 64 ders

--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
David Thompson
2008-08-25 03:01:28 UTC
Permalink
On Tue, 12 Aug 2008 11:39:32 +0300, "Sergey P. Derevyago"
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
You include stdio.h in a cpp program. this is wrong; you need
<cstdio>
Not, really.
<stdio.h> is really legal: no one is going to ban POSIX.
POSIX isn't the issue; stdio is standard-C and standard-C++.
Things like <fcntl.h> are (only) POSIX.

For new-in-C++ features standard-C++ has only the namespace-std::
versions e.g. <vector> not the pre-standard <vector.h>. But for the
parts adopted from C it has both the namespace-std:: version e.g.
<cstdio> and the 'global' version <stdio.h>. The latter, because they
don't provide the benefits of namespace segregation, are in general
considered poorer style, but they are absolutely standard.

Of course this isn't really relevant to threading.

- formerly david.thompson1 || achar(64) || worldnet.att.net

Sergey P. Derevyago
2008-08-12 08:26:49 UTC
Permalink
Post by Chris M. Thomasson
Please excuse my ignorance wrt your library, but how do I build example1
for a windows platform with mingw?
Please look at mdgcc.bat and mogcc.bat files.
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 05:59:49 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
Could you please plug your algorithm to my example1 and show the results?
Your not going to like them... Here is the code compliable with MSVC 7.1 or
higher:

http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html

I plug my multi-threading allocator in your test code as:

const int N=1000;
const int M=10000;

class sergey_vzmem_thread : public thread_base {
void on_entry() {
std::list<int> lst;
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) lst.push_back(j);
for (int j=0; j<M; j++) lst.pop_front();
}
}
};



I get on release build for 2 threads an output of:


(1) - test running
(1) - test finished


test time: 2875 ms
--------------------






With your code on mingw, for 2 threads I get an output of:

2 8282 std






With my allocator and 10 threads I get an output of:

(1) - test running
(1) - test finished


test time: 14563 ms
--------------------




you code with 10 threads is:

10 39213 ders






My smart multi-threading region allocator kills yours dead; Sorry.
Chris M. Thomasson
2008-08-12 06:04:50 UTC
Permalink
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
http://groups.google.com/group/comp.programming/msg/bb3c14740615e266
Could you please plug your algorithm to my example1 and show the results?
Your not going to like them... Here is the code compliable with MSVC 7.1
http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html
const int N=1000;
const int M=10000;
class sergey_vzmem_thread : public thread_base {
void on_entry() {
std::list<int> lst;
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) lst.push_back(j);
for (int j=0; j<M; j++) lst.pop_front();
}
}
};
(1) - test running
(1) - test finished
test time: 2875 ms
--------------------
2 8282 std
Well, I make cut and paste mistake. Okay, I need to create fresh thread, and
post results.


The mistake I made was on the 2 thread version of your run. I accidentally
ran the stl version (e.t., cmd-line of 2 std, instead of 2 ders). The output
was actually:

2 9656 ders




I will entitle the thread:


Sergey P. Derevyago -VS- vzmem...
Post by Chris M. Thomasson
(1) - test running
(1) - test finished
test time: 14563 ms
--------------------
10 39213 ders
My smart multi-threading region allocator kills yours dead; Sorry.
Sergey P. Derevyago
2008-08-12 08:43:40 UTC
Permalink
Post by Chris M. Thomasson
My smart multi-threading region allocator kills yours dead; Sorry.
Very good :)
But the test isn't quite fair, I'll send the details later...
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 09:55:49 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
My smart multi-threading region allocator kills yours dead; Sorry.
Very good :)
But the test isn't quite fair, I'll send the details later...
Well, my allocator seems to beat yours on my specific platform of P4 3.06GHZ
512mb. That's not conclusive. What I need to do is get my allocator running
under a portable atomic operations API. I think I will go with the HP
atomic_ops package:

http://www.hpl.hp.com/research/linux/atomic_ops/

Its portable to various "popular" platforms, and will allow others to runs
my allocator who do not have x86.
Chris M. Thomasson
2008-08-08 01:36:11 UTC
Permalink
Post by Sergey P. Derevyago
Can one thread allocate memory block, pass it to another which ultimately
frees it? If not, then how are you going to handle producer/consumer?
1. Thread-private allocator
http://ders.stml.net/cpp/mtprog/doc/classders_1_1mem__pool.html means that
its memory blocks doesn't cross thread boundaries. I.e. you can't get a
block from Alloc1 in Thr1 and free it in Thr2.
2. The common solution is serialization through synchronized queues
serialize your data in Thr1 and reconstruct it back in Thr2 using its
Alloc2.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

IMHO, the serialization has "unnecessary" overhead when compared to simply
passing a pointer through a queue. BTW, just to clear things up, if thread A
passes object1 over to thread B, using your scheme, thread B would not
contain the actual object1, but a brand new copy of it instead right? How do
you handle scenarios in which a _single_ object needs to be maintained
between a producer and consumer? For instance, how do I pass objects between
threads which cannot be copied? e.g, perhaps it has a private copy
constructor or something along those lines...
Post by Sergey P. Derevyago
3. Yes, I understand that there can be really big data chunks to pass
between threads. The obvious solution is to use usual new/delete to
allocate this big data and pass a pointer rather then serialize the whole
memory block.
BTW I've written several real-world application using derslib. Just
download http://ders.stml.net/cpp/mtprog/code.zip and look at (sorted by
complexity)
mtprog\src\mtprog\mtftext : just a tutorial sample of trivial grep
mtprog\src\mtprog\mtcksrc : checks source code files
mtprog\src\mtprog\mtdel : deletes files by mask[s]
mtprog\src\mtprog\mtcnvsrc : converts source code files
mtprog\src\mtprog\mtdirdiff: compares directories
The sources are well-written so you'll get no problems to see what's going
on.
Agreed. Unfortunately, I have only had time to briefly skim over the
code...

;^(
Sergey P. Derevyago
2008-08-08 07:00:23 UTC
Permalink
Post by Chris M. Thomasson
IMHO, the serialization has "unnecessary" overhead when compared to
simply passing a pointer through a queue.
As a rule, the need to use _synchronized_ mem_pool-s is order(s) of
magnitude slower.
Also there is no need to pass (i.e. serialize/restore) some message to
another worker thread if it can be processed "in place" faster.
Post by Chris M. Thomasson
BTW, just to clear things up,
if thread A passes object1 over to thread B, using your scheme, thread B
would not contain the actual object1, but a brand new copy of it instead
right?
Sure.
Post by Chris M. Thomasson
How do you handle scenarios in which a _single_ object needs to
be maintained between a producer and consumer? For instance, how do I
pass objects between threads which cannot be copied? e.g, perhaps it has
a private copy constructor or something along those lines...
In essence, your statement is "but if I need X then Y..."
But why do you need X? Could you please show me some _realworld_
example where you have to copy non-copyable objects between threads?
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Dmitriy V'jukov
2008-08-08 07:41:44 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
IMHO, the serialization has "unnecessary" overhead when compared to
simply passing a pointer through a queue.
As a rule, the need to use _synchronized_ mem_pool-s is order(s) of
magnitude slower.
On local operations synchronized mempool has precisely the same
overheads as single-threaded mempool. Well, Ok, SMART synchronized
mempool.
Post by Sergey P. Derevyago
Also there is no need to pass (i.e. serialize/restore) some message to
another worker thread if it can be processed "in place" faster.
But consider following situation.
serialize_restore_overhead == processing_time
time_to_pass_object_by_pointer << processing_time

Dmitriy V'jukov
Sergey P. Derevyago
2008-08-08 09:08:04 UTC
Permalink
Post by Dmitriy V'jukov
On local operations synchronized mempool has precisely the same
overheads as single-threaded mempool. Well, Ok, SMART synchronized
mempool.
Impossible.
This time I don't want to cite theoretical stuff why synchronization is
ALWAYS worse than no synchronization.

I'd better suggest the following experiment: please take my
mtprog\examples\example1\main.cpp example and plug your "SMART
synchronized mempool" to see the difference...
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Dmitriy V'jukov
2008-08-08 10:00:37 UTC
Permalink
Post by Sergey P. Derevyago
Post by Dmitriy V'jukov
On local operations synchronized mempool has precisely the same
overheads as single-threaded mempool. Well, Ok, SMART synchronized
mempool.
Impossible.
This time I don't want to cite theoretical stuff why synchronization is
ALWAYS worse than no synchronization.
I'd better suggest the following experiment: please take my
mtprog\examples\example1\main.cpp example and plug your "SMART
synchronized mempool" to see the difference...
There will be no difference, because method alloc() will be the same,
except one additional 'if' on *slow* path; method free() will be the
same exactly; plus one additional method - remote_free(). That is all.
Exactly the same performance as your single-threaded mempool.

Dmitriy V'jukov
Sergey P. Derevyago
2008-08-08 10:12:40 UTC
Permalink
Post by Dmitriy V'jukov
Exactly the same performance as your single-threaded mempool.
Nevertheless, please perform _real_ measurements.
"In theory, there is no difference between practice and theory. In
practice, there is."
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Dmitriy V'jukov
2008-08-08 10:57:55 UTC
Permalink
Post by Sergey P. Derevyago
Post by Dmitriy V'jukov
Exactly the same performance as your single-threaded mempool.
Nevertheless, please perform _real_ measurements.
"In theory, there is no difference between practice and theory. In
practice, there is."
I don't want to waste my time... at least this way. Please, can you
make some hint about where I can expect performance degradation? The
only thing I see is one additional 'if' on *slow-path* in alloc()...

Dmitriy V'jukov
Sergey P. Derevyago
2008-08-08 11:29:21 UTC
Permalink
Post by Dmitriy V'jukov
I don't want to waste my time... at least this way.
So please don't make unverified statements!
Post by Dmitriy V'jukov
Please, can you
make some hint about where I can expect performance degradation? The
only thing I see is one additional 'if' on *slow-path* in alloc()...
1. I published my code and the performance/scalability results.
2. There is no your code, the claims only.

So what do you expect?! Please don't make questionable statements!
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-08 19:27:51 UTC
Permalink
Post by Sergey P. Derevyago
Post by Dmitriy V'jukov
I don't want to waste my time... at least this way.
So please don't make unverified statements!
Post by Dmitriy V'jukov
Please, can you
make some hint about where I can expect performance degradation? The
only thing I see is one additional 'if' on *slow-path* in alloc()...
1. I published my code and the performance/scalability results.
2. There is no your code, the claims only.
So what do you expect?! Please don't make questionable statements!
You don't seem to get it. The "SMART" memory pool allocator has the exact
same performance as yours. Except its general-purpose (e.g., can be used to
implement malloc/free) and far more flexible. For instance, you can check
the performance results of the StreamFlow allocator:

http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf

also, you can test one of my multi-threaded region allocator algorihtms:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html

This is general-purpose such that it can be used to implement global
new/delete operators. If you step through the code you will quickly notice
that there is absolutely NO synchronization in local allocations, or local
deallocations. The only time any sync is used is when a thread frees memory
that it did not allocate, and when a local allocation hit an exhausted
thread-private pool.
Sergey P. Derevyago
2008-08-11 08:51:01 UTC
Permalink
Post by Chris M. Thomasson
You don't seem to get it. The "SMART" memory pool allocator has the
exact same performance as yours.
Have you really tested?
Post by Chris M. Thomasson
http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
I only skimmed over the code and see that your allocate_local() under
some circumstances calls allocate_remote(sz) which imposes
synchronization overhead.
So the question remains: have you _really_ tested?
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 03:33:37 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
You don't seem to get it. The "SMART" memory pool allocator has the exact
same performance as yours.
Have you really tested?
Post by Chris M. Thomasson
http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
I only skimmed over the code and see that your allocate_local() under
some circumstances calls allocate_remote(sz) which imposes synchronization
overhead.
So the question remains: have you _really_ tested?
Yes. When you get some more free time, go ahead and look at the actual
_reason_ why the local allocation function defers to the remote one. Its
only because the local thread-private pool is totally exhausted. Therefore,
when a thread-private pool is totally full, my allocator uses
synchronization which has its fairly significantly overheads indeed. IMVHO,
an atomic operation is expensive; period. That's why I admire local
allocation algorithms such as yours. Mine is totally general purpose, and
will use sync ops on extreme boundary conditions; such as thread pool full
condition.

One point, if an application constantly allocates in one thread an frees in
another, well, the sync-condition will be bounded by the depth of the
thread-private memory pool itself.
Dmitriy V'jukov
2008-08-11 18:59:07 UTC
Permalink
Post by Sergey P. Derevyago
Post by Dmitriy V'jukov
I don't want to waste my time... at least this way.
So please don't make unverified statements!
Post by Dmitriy V'jukov
Please, can you
make some hint about where I can expect performance degradation? The
only thing I see is one additional 'if' on *slow-path* in alloc()...
1. I published my code and the performance/scalability results.
2. There is no your code, the claims only.
So what do you expect?! Please don't make questionable statements!
There is really no need for 'my code'. If you will provide hook for
slow-path allocation in your allocator, then you can consider 'your
code' as 'my code'.


Dmitriy V'jukov
Chris M. Thomasson
2008-08-12 03:37:03 UTC
Permalink
Post by Dmitriy V'jukov
Post by Sergey P. Derevyago
Post by Dmitriy V'jukov
I don't want to waste my time... at least this way.
So please don't make unverified statements!
Post by Dmitriy V'jukov
Please, can you
make some hint about where I can expect performance degradation? The
only thing I see is one additional 'if' on *slow-path* in alloc()...
1. I published my code and the performance/scalability results.
2. There is no your code, the claims only.
So what do you expect?! Please don't make questionable statements!
There is really no need for 'my code'. If you will provide hook for
slow-path allocation in your allocator, then you can consider 'your
code' as 'my code'.
RIGHT! Great point. I totally forgot to mention:

http://groups.google.com/group/comp.arch/browse_frm/thread/6dc825ec9999d3a8

Again, Sergey, when you get free time, please consider the thread linked to
above. Your mem_pool work can be automatically turned into something that is
directly analogous to the `ismm06.pdf' paper.
Sergey P. Derevyago
2008-08-12 08:53:48 UTC
Permalink
Post by Chris M. Thomasson
Again, Sergey, when you get free time, please consider the thread linked
to above. Your mem_pool work can be automatically turned into something
that is directly analogous to the `ismm06.pdf' paper.
Yes, this is the root of confusion!

My mem_pool is surely built on standard new/delete:

if (size<=mp_impl::MOS1) {
void*& head=IMPL->heads1[(size-1)/mp_impl::SZVP];
if (!head) IMPL->format_new_chunk(head, size, mp_impl::SZVP);

ret=head;
head=*reinterpret_cast<void**>(head);
}
else if (size<=mp_impl::MOS2) {
void*& head=IMPL->heads2[(size-mp_impl::MOS1-1)/mp_impl::SZVP8];
if (!head) IMPL->format_new_chunk(head, size, mp_impl::SZVP8);

ret=head;
head=*reinterpret_cast<void**>(head);
}
else ret=operator new(size); // <-------------------------------------

But your code replaces new/delete rather then mem_pool. So you should
compare
mem_pool+your_pool with your_pool
rather than
mem_pool+new/delete with your_pool.

What I wanted to say is: you should use thread-private mem_pool over
some other general-purpose memory allocator because a bare
general-purpose memory allocator is always slower!
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 10:18:12 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
Again, Sergey, when you get free time, please consider the thread linked
to above. Your mem_pool work can be automatically turned into something
that is directly analogous to the `ismm06.pdf' paper.
Yes, this is the root of confusion!
if (size<=mp_impl::MOS1) {
void*& head=IMPL->heads1[(size-1)/mp_impl::SZVP];
if (!head) IMPL->format_new_chunk(head, size, mp_impl::SZVP);
ret=head;
head=*reinterpret_cast<void**>(head);
}
else if (size<=mp_impl::MOS2) {
void*& head=IMPL->heads2[(size-mp_impl::MOS1-1)/mp_impl::SZVP8];
if (!head) IMPL->format_new_chunk(head, size, mp_impl::SZVP8);
ret=head;
head=*reinterpret_cast<void**>(head);
}
else ret=operator new(size); // <-------------------------------------
But your code replaces new/delete rather then mem_pool. So you should
compare
mem_pool+your_pool with your_pool
rather than
mem_pool+new/delete with your_pool.
You mean allow your mem_pool to use my region allocator as a backing store?
Post by Sergey P. Derevyago
What I wanted to say is: you should use thread-private mem_pool over some
other general-purpose memory allocator because a bare general-purpose
memory allocator is always slower!
Anyway, on Dmitriy's box, my general-purpose algorithm is around 100ms
slower for 4 threads, and about 1 second slower for 16 threads. Pretty good
for bare general-purpose vs. specialized algorithm:

http://groups.google.com/group/comp.programming.threads/msg/cdb7a19bea94c3e9
Chris M. Thomasson
2008-08-08 19:11:22 UTC
Permalink
Post by Sergey P. Derevyago
Post by Dmitriy V'jukov
On local operations synchronized mempool has precisely the same
overheads as single-threaded mempool. Well, Ok, SMART synchronized
mempool.
Impossible.
It is possible. I have done it. Other have as well. BTW, did you know that
Hoard also uses thread-private pools? I would suggest doing some research
into state-of-the-art memory allocation algorihtms. Have you every heard of
StreamFlow? Read this paper:

http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf

Their allocator is just as fast as yours, except its a heck of a lot more
flexible because memory can be freed into any thread. Keep in mind that
local allocations and deallocations are 100% synchronization free. Unlike
yours, you can actually use a producer/consumer pattern which does not
require the overhead of serialization.
Post by Sergey P. Derevyago
This time I don't want to cite theoretical stuff why synchronization is
ALWAYS worse than no synchronization.
You seem to be forgetting the conversation we had a couple of years ago. You
agreed that my allocator design makes sense:

http://groups.google.com/group/comp.programming.threads/msg/f5ea5a977a156a03

You already know that its not impossible to create a thread-private
allocator in which memory can be freed into any thread.
Post by Sergey P. Derevyago
I'd better suggest the following experiment: please take my
mtprog\examples\example1\main.cpp example and plug your "SMART
synchronized mempool" to see the difference...
You want to compare one thread-private allocator to another. What's that
going to prove?
Sergey P. Derevyago
2008-08-11 08:58:32 UTC
Permalink
Post by Chris M. Thomasson
You seem to be forgetting the conversation we had a couple of years ago.
http://groups.google.com/group/comp.programming.threads/msg/f5ea5a977a156a03
You already know that its not impossible to create a thread-private
allocator in which memory can be freed into any thread.
I need a timeout :)
I'm going to review ismm06.pdf and return to this question, OK?
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
I'd better suggest the following experiment: please take my
mtprog\examples\example1\main.cpp example and plug your "SMART
synchronized mempool" to see the difference...
You want to compare one thread-private allocator to another. What's that
going to prove?
It does really make sense to compare the map to hash_map performance,
doesn't it?
There are a lot of different allocator implementations, some of them
are inherently slower then others. It always makes sense to perform
_real_ measurements.
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 10:09:13 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
You seem to be forgetting the conversation we had a couple of years ago.
http://groups.google.com/group/comp.programming.threads/msg/f5ea5a977a156a03
You already know that its not impossible to create a thread-private
allocator in which memory can be freed into any thread.
I need a timeout :)
I'm going to review ismm06.pdf and return to this question, OK?
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
I'd better suggest the following experiment: please take my
mtprog\examples\example1\main.cpp example and plug your "SMART
synchronized mempool" to see the difference...
You want to compare one thread-private allocator to another. What's that
going to prove?
It does really make sense to compare the map to hash_map performance,
doesn't it?
There are a lot of different allocator implementations, some of them are
inherently slower then others. It always makes sense to perform _real_
measurements.
I am interested in measuring general-purpose against your special-purpose
algorithm.
Giancarlo Niccolai
2008-08-08 14:01:02 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
IMHO, the serialization has "unnecessary" overhead when compared to
simply passing a pointer through a queue.
As a rule, the need to use _synchronized_ mem_pool-s is order(s) of
magnitude slower.
Also there is no need to pass (i.e. serialize/restore) some message
to another worker thread if it can be processed "in place" faster.
Post by Chris M. Thomasson
BTW, just to clear things up, if thread A passes object1 over to
thread B, using your scheme, thread B would not contain the actual
object1, but a brand new copy of it instead right?
Sure.
Post by Chris M. Thomasson
How do you handle scenarios in which a _single_ object needs to be
maintained between a producer and consumer? For instance, how do I
pass objects between threads which cannot be copied? e.g, perhaps it
has a private copy constructor or something along those lines...
In essence, your statement is "but if I need X then Y..."
But why do you need X? Could you please show me some _realworld_
example where you have to copy non-copyable objects between threads?
The local allocation/memory pool/garbage collector, with object
serialization to pass objects between threads is the main model I used
to write the multithreading module for the Falcon programming language.
Also, I provide some objects not needing memory pool management to share
directly memory and simple data (i.e. numbers, fixed size strings etc.)
between threads.

Unless there is the need to share very complex objects (and usually it
is possible to ask the thread owning the object to update it via
efficient remote-thread calls) the higher efficiency and performance of
the memory allocation and garbage management seems to counterbalance the
higher complexity of data sharing. Of course, some burden is put on the
language user, that needs to know that objects are NOT shared, at best
they are copied, and work around this limit.

However, for a program written ground-up and addressing real-world
problems it seems not to add too much complexity and enforce an
agent-based MT programming model, where every agent is responsible to
keep its data and share INFORMATIONS about its STATE, rather than the
data itself, which helps to keep a cleaner MT environment. I have
written and seen many C/C++ programs that sooner or later ended to
pursue this model and this goals, as the complexity of the MT
interactions raised.

Ok, long words to say that there are full _realworld_ examples in which
you don't need to copy or share much between threads, so I lean for Sergey.

---

OTOH, I've been loosely working to a shared pool model where data is
created in a thread-specific pool, but can be sent around to any other
thread in the program, and then re-collected after each other thread is
done, directly by the pool (and thread) where the object was created.

As soon as I have a bit of free time from our next release I will try to
do go deep in that, so to have the best of both worlds and being able to
share also big, complex objects directly across threads. If someone is
interested, I'd be pleased to enter the details; and we always need
hands, so if someone wish to give a try at this starting from an already
working base, he/she is welcome...

Bests,
Giancarlo Niccolai.
Sergey P. Derevyago
2008-08-08 14:26:24 UTC
Permalink
Post by Giancarlo Niccolai
every agent is responsible to
keep its data and share INFORMATIONS about its STATE, rather than the
data itself, which helps to keep a cleaner MT environment
Concise and clear!
I see we have (almost) the same vision, thank you.
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-08 19:03:34 UTC
Permalink
"Giancarlo Niccolai" <gc-***@removeme-niccolai.cc> wrote in message news:489c519f$0$11385$***@news.tiscali.it...
[...]
Post by Giancarlo Niccolai
OTOH, I've been loosely working to a shared pool model where data is
created in a thread-specific pool, but can be sent around to any other
thread in the program, and then re-collected after each other thread is
done, directly by the pool (and thread) where the object was created.
As soon as I have a bit of free time from our next release I will try to
do go deep in that, so to have the best of both worlds and being able to
share also big, complex objects directly across threads. If someone is
interested, I'd be pleased to enter the details; and we always need
hands, so if someone wish to give a try at this starting from an already
working base, he/she is welcome...
Here is pseudo-code for a memory-allocator that is based on 100%
thread-private pools, however, memory can be freed by any thread:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69

Here some crude, but fully compliable code for a general-purpose region
allocator that is based on a similar algorithm:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html

It uses some x86 assembly which compiles under MSVC++. If you study the code
you will see that local allocations and deallocations are synchronization
free. Remote allocations use a lock-free operation, and attempts to allocate
from an exhausted local pool uses a wait-free operation.
Giancarlo Niccolai
2008-08-09 13:57:32 UTC
Permalink
Post by Chris M. Thomasson
[...]
Here is pseudo-code for a memory-allocator that is based on 100%
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69
Here some crude, but fully compliable code for a general-purpose region
http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
Thanks! I will dig into that.

Gian.
Chris M. Thomasson
2008-08-08 19:21:19 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
IMHO, the serialization has "unnecessary" overhead when compared to
simply passing a pointer through a queue.
As a rule, the need to use _synchronized_ mem_pool-s is order(s) of
magnitude slower.
Also there is no need to pass (i.e. serialize/restore) some message to
another worker thread if it can be processed "in place" faster.
Post by Chris M. Thomasson
BTW, just to clear things up, if thread A passes object1 over to thread
B, using your scheme, thread B would not contain the actual object1, but
a brand new copy of it instead right?
Sure.
Post by Chris M. Thomasson
How do you handle scenarios in which a _single_ object needs to be
maintained between a producer and consumer? For instance, how do I pass
objects between threads which cannot be copied? e.g, perhaps it has a
private copy constructor or something along those lines...
In essence, your statement is "but if I need X then Y..."
But why do you need X? Could you please show me some _realworld_ example
where you have to copy non-copyable objects between threads?
The point is that I cannot use your system to use objects with a private
copy-ctor in a producer/consumer pattern.

Lets say that I have a server connection object that has buffers and socket,
and pointers to other active connections. When data comes in on the
connections socket I want to pass a pointer of the object to a pool of
consumer threads which act on the received data. If I use your system, each
time data comes in and I need to hand a pointer to the object off to a
consumer, the whole connection object will be serialized into another copy.
So, before you know it, there are multiple copies of the connection object
floating around. I only want ONE connection object per-socket, therefore its
copy-ctor and assignment operators would be private, which means I cannot
use your allocator.

I don't see how you could do efficient producer/consumer pattern if every
production results in a full-blown copy.
Giancarlo Niccolai
2008-08-09 13:56:56 UTC
Permalink
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
IMHO, the serialization has "unnecessary" overhead when compared to
simply passing a pointer through a queue.
As a rule, the need to use _synchronized_ mem_pool-s is order(s) of
magnitude slower.
Also there is no need to pass (i.e. serialize/restore) some message to
another worker thread if it can be processed "in place" faster.
Post by Chris M. Thomasson
BTW, just to clear things up, if thread A passes object1 over to
thread B, using your scheme, thread B would not contain the actual
object1, but a brand new copy of it instead right?
Sure.
Post by Chris M. Thomasson
How do you handle scenarios in which a _single_ object needs to be
maintained between a producer and consumer? For instance, how do I
pass objects between threads which cannot be copied? e.g, perhaps it
has a private copy constructor or something along those lines...
In essence, your statement is "but if I need X then Y..."
But why do you need X? Could you please show me some _realworld_
example where you have to copy non-copyable objects between threads?
The point is that I cannot use your system to use objects with a private
copy-ctor in a producer/consumer pattern.
Lets say that I have a server connection object that has buffers and
socket, and pointers to other active connections. When data comes in on
the connections socket I want to pass a pointer of the object to a pool
of consumer threads which act on the received data. If I use your
system, each time data comes in and I need to hand a pointer to the
object off to a consumer, the whole connection object will be serialized
into another copy. So, before you know it, there are multiple copies of
the connection object floating around. I only want ONE connection object
per-socket, therefore its copy-ctor and assignment operators would be
private, which means I cannot use your allocator.
I don't see how you could do efficient producer/consumer pattern if
every production results in a full-blown copy.
I solved this problem through an object implementing a shared memory
abstraction.

The thread receiving the data creates a thread-local instance of "shared
memory" object, which is more or less a reference counter and a pointer
to the shared memory, which is separately allocated in the global heap.

The receiver thread then fills the shared memory area with the incoming
data. When the object (counter + pointer) is "serialized" and sent to
another thread, the counter is atomically incremented; when it is
"deserialized" and a local instance of the shared memory object is
created, nothing is done. When a local GC frees an instance, the counter
is decremented. When the counter hits 0, the shared memory is freed from
the heap. (Needless to say, the counter is part of the heap/shared memory).

In short, only the "shell" of the data needs to be copied across
threads, while the "core" of the data can be shared.

I have even a facility in the serialization code which detects if the
serialization is "live" or not. Live serialization is meant to live
short and be performed through threads of the same process, so there is
no need to i.e. unroll pointers into streams. It is highly optimized (it
resolves more or less in a flat memcopy of the data).

Indeed, there is an overhead even in this case, as sending a pointer
around is better than sending a copy of a small structure around, and
you need a bit of extra memory, but the simplicity of the allocation and
of the GC process counterbalance this cost.

GN.
Sergey P. Derevyago
2008-08-11 09:09:02 UTC
Permalink
Post by Chris M. Thomasson
Lets say that I have a server connection object that has buffers and
socket, and pointers to other active connections. When data comes in on
the connections socket I want to pass a pointer of the object to a pool
of consumer threads which act on the received data. If I use your
system, each time data comes in and I need to hand a pointer to the
object off to a consumer, the whole connection object will be serialized
into another copy.
Not, really.
It doesn't make sense to copy if you can make sure that the connection
object is used only by one thread at a time.

If you look at mtprog\mtdirdiff\maintask.hpp you'll see that DirCont
has its own mem_pool:

struct DirCont {
mem_pool ownmp;
sh_text name;
hash_vec<sh_text, char> dirs;
hash_vec<sh_text, unsigned long> files;
// ...
};

The point is that it's used only by one thread at a time but the thread
isn't always the same.
So we see not a thread-private mem_pool but a DATA-private one.
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 04:55:53 UTC
Permalink
Post by Sergey P. Derevyago
* I've written a C++ multithreading tutorial. Here it is if you can read
Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
* The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
English comments.
* Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html
I have no time to create a full English translation so I'm going to
describe only several important issues.
[...]

what's going on Sergey? On my P4 3.06gz single-cpu hyper-thread box I get:

an output of

2 8282 std

using a command line of:

2 std

------------------------------------------------------------------------

I get an output of:

2 9656 ders

using a command like of:

2 ders



Yours is definitely slower than the standard on my box.



Will somebody else post their results on similar iron?




I am going to post the results using my one of my region allocators here:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html


within a day or two.
Chris M. Thomasson
2008-08-12 05:33:43 UTC
Permalink
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
* I've written a C++ multithreading tutorial. Here it is if you can read
Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
* The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
English comments.
* Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html
I have no time to create a full English translation so I'm going to
describe only several important issues.
[...]
an output of
2 8282 std
2 std
------------------------------------------------------------------------
2 9656 ders
2 ders
Yours is definitely slower than the standard on my box.
[...]

Yours is on average around 3-5 seconds faster than standard when I crank up
the threads to over 8. I get:


10 44124 std
10 39213 std
Chris M. Thomasson
2008-08-12 06:00:58 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
* I've written a C++ multithreading tutorial. Here it is if you can read
Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
* The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
English comments.
* Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html
I have no time to create a full English translation so I'm going to
describe only several important issues.
[...]
an output of
2 8282 std
2 std
------------------------------------------------------------------------
2 9656 ders
2 ders
Yours is definitely slower than the standard on my box.
[...]
Yours is on average around 3-5 seconds faster than standard when I crank
10 44124 std
10 39213 std
OOPS! the last one was suppose to read as:

10 39213 ders
Chris M. Thomasson
2008-08-12 05:42:07 UTC
Permalink
Post by Chris M. Thomasson
Post by Sergey P. Derevyago
* I've written a C++ multithreading tutorial. Here it is if you can read
Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
* The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
English comments.
* Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html
I have no time to create a full English translation so I'm going to
describe only several important issues.
[...]
[...]
Post by Chris M. Thomasson
http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
within a day or two.
You example1 test application for normal standard list is:

const int N=1000;
const int M=10000;

void start_std(void*)
{
list<int> lst;
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) lst.push_back(j);
for (int j=0; j<M; j++) lst.pop_front();
}
}

where start_srd is the thread entry function right:
Sergey P. Derevyago
2008-08-12 08:56:06 UTC
Permalink
Post by Chris M. Thomasson
an output of
2 8282 std
2 std
------------------------------------------------------------------------
2 9656 ders
2 ders
Yours is definitely slower than the standard on my box.
Have you disabled assertions with -DNDEBUG?

--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
Chris M. Thomasson
2008-08-12 09:58:32 UTC
Permalink
Post by Sergey P. Derevyago
Post by Chris M. Thomasson
an output of
2 8282 std
2 std
------------------------------------------------------------------------
2 9656 ders
2 ders
Yours is definitely slower than the standard on my box.
Have you disabled assertions with -DNDEBUG?
http://groups.google.com/group/comp.programming.threads/msg/ad1b3f703c748892
Chris M. Thomasson
2008-08-12 11:11:09 UTC
Permalink
Post by Sergey P. Derevyago
* I've written a C++ multithreading tutorial. Here it is if you can read
Russian: http://ders.stml.net/cpp/mtprog/mtprog.html.
* The source code is http://ders.stml.net/cpp/mtprog/code.zip and it has
English comments.
* Doxygen is http://ders.stml.net/cpp/mtprog/doc/index.html
I have no time to create a full English translation so I'm going to
describe only several important issues.
[...]

Do you have an alignment issue with your code? Apparently, you do:

http://groups.google.com/group/comp.programming.threads/msg/2143c3a4344f25a7

I have not confirmed this yet. Actually, I only very briefly scan your code,
and did not bother to look at alignment because I figured that was a given
such that any allocator code would explicitly handle it because it such an
important issue indeed. If you code does align memory correctly, please
point out that moment. Otherwise, you really should fix this error. It will
slow the performance down just a little bit, but unfortunately that's a
price to pay for correctness...
Continue reading on narkive:
Loading...