Memory usage and tail calls #534

yorickhardy · 2024-03-25T19:10:00Z

Hello,

I am trying out (srfi 18) in cyclone and noticed that memory was being "consumed" very quickly by a simple monitoring thread, the following small example seems to exhibit this behaviour:

(define (busy x) (busy (+ x 1)))
(busy 0)

Similarly, the cyclone compiler can be encouraged to use excessive memory (and compile time) by:

(define (compile-forever x) x (compile-forever x))
(compile-forever 0)

Both examples work as expected in icyc (with regards to memory use).

Am I doing something unreasonable here? I am using cyclone-0.36.0 on NetBSD 10.99.10 amd64 (current-ish).

The text was updated successfully, but these errors were encountered:

justinethier · 2024-03-26T02:31:31Z

Hi @yorickhardy. First of all, thanks for the report! These both seem like real issues, let me find some time this week to take a closer look.

justinethier · 2024-04-02T02:35:22Z

Regarding:

(define (busy x) (busy (+ x 1)))
(busy 0)

Suspect this is related to the amount of memory added to a thread after major GC. Need to look into this more.

Regarding the compile-forever example, there is a problem with the CPS optimization beta expansion phase where it is failing to detect a recursive call, and executing forever. Looks like the problem (or at least part of it) is that analyze:find-recursive-calls does not properly handle edge cases where an ast:lambda call is made, or later on when the optimizer changes such calls to use Cyc-seq, EG:

(analyze:find-recursive-calls                                                    
  scan                                                                           
  app                                                                            
  (Cyc-seq x$1$2 (compile-forever k$5 x$1$2)))

Perform full scanning of function application list to ensure self-recursive calls are found. This prevents infinite loops in the beta expansion code when compiling simple recursive calls.

yorickhardy · 2024-04-03T19:20:37Z

Thanks!

I am sorry that I have not contributed any fixes, I am going to (eventually) try to put more time into understanding cyclone and help where/if I can.

justinethier · 2024-04-04T00:47:13Z

No worries, and bug reports are always appreciated! I would welcome fixes as well, however, these two issues in particular are in areas that would be difficult to track down... and I still need to investigate the first one :)

justinethier · 2024-04-09T02:34:25Z

As this runs, + will eventually cause a conversion from fixnum integer to a bignum. I suspect that cyclone is inefficiently allocating memory for bignums in this use case and that is the cause of the high memory usage.

Consider the memory usage when using fixnums or doubles:

(import (srfi 143))                                                             
(define (busy x) (busy (fx+ x 1)))                                              
(busy 0)

(define (busy x) (busy (+ x 1.0)))                                              
(busy 0.0)

That said, I suspect we can do better here, especially since the interpreters can. Will need to spend time looking into this further.

justinethier · 2024-04-10T02:28:04Z

Snippet of code from Cyc_sum:

    } else if (tx == bignum_tag && ty == -1) { \                                 
        BIGNUM_CALL(mp_init(&bn_tmp2)); \                                        
        Cyc_int2bignum(obj_obj2int(y), &bn_tmp2); \                               
        BIGNUM_CALL(BN_OP(&(x->bignum_t.bn), &bn_tmp2, &(x->bignum_t.bn))); \    
        mp_clear(&bn_tmp2); \

Compare with code from Cyc_fast_sum used in compiled code:

  if (is_object_type(x) && type_of(x) == bignum_tag) {                           
    if (obj_is_int(y)) {                                                         
      mp_int bny;                                                                
      BIGNUM_CALL(mp_init(&bny));                                                
      Cyc_int2bignum(obj_obj2int(y), &bny);                                      
      alloc_bignum(data, bn);                                                    
      BIGNUM_CALL(mp_add(&bignum_value(x), &bny, &bignum_value(bn)));            
      mp_clear(&bny);                                                            
      return bn;

Questions: Why are we doing an allocation here but not above, and can we safely speed up / optimize the latter code?

yorickhardy · 2024-04-10T23:44:31Z

I did not realize that it had switched over to bignum! Thanks. I have more or less isolated the memory usage problems that I was encountering (firstly, I was using many short lived threads instead of a thread pool and assumed that thread-join would (eventually) garbage collect the thread: that assumption is false as far as I can tell). Here is another example with bounded(?) but large memory use:

;  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
; 9949 yorick    25    0   556M  425M CPU/1       0:17 97.91% 56.10% test
(define (loop-test)
  (let ( (o (open-output-string)) )
    (display "abc" o)
    (close-output-port o)
    (loop-test) ) )

(loop-test)

and similarly with constantly growing memory use:

;  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
; 3563 yorick    26    0    13G   10G CPU/0       0:49   164%   128% test

(define output-port (make-parameter #f))

(define (loop-test)
  (parameterize ( (output-port (open-output-string)) )
    (display "abc" (output-port))
    (close-output-port (output-port))
    (loop-test) ) )

(loop-test)

which surprised me! This simple example grows a bit slower but in the same way:

(define a-string (make-parameter #f))

(define (loop-test)
 (parameterize ( (a-string "123") )
  (loop-test) ) )

(loop-test)

Strangely, icyc manages fine. I guess the use of parameterize here is not great since one can easily re-write this example to avoid the nested parameterizations.

I am not sure whether these examples are helpful, or just examples of poorly written scheme!

At this point I am not convinced that the remaining part is a valid issue, or poor programming on my part - so please close the issue if you are satisfied.

yorickhardy · 2024-04-11T21:54:10Z

Hello again,

The memory consumption that motivated this issue is all due to the use of many threads, via srfi-18 and Cyc_scm_call. I have switched to thread pools and Cyc_scm_call_no_gc and the memory consumption is now normal.

I was very surprised that terminated threads consume memory. Nevertheless, I don't think the issue is entirely valid as reported, since the underlying problem was threads (not GC and tail calls). Apologies if I have wasted too much of your time.

(I do appreciate that you have looked into the reported examples and that you are willing to investigate improvements.)

Thanks!

justinethier · 2024-04-15T01:07:24Z

Glad you got it working @yorickhardy!

I appreciate your feedback and think there are genuine issues that are being raised here, though I have not dug into your latest loop-test examples. Let me spend some time looking into everything you brought up to see what can be improved.

justinethier · 2024-04-23T01:36:41Z

@yorickhardy Do you have an example of thread-join not freeing up memory?

yorickhardy · 2024-04-23T21:02:41Z

Sure! I hope I have not missed anything obvious ...

(import (scheme base) (scheme write) (srfi 18))

(define (monitor n)
  (thread-yield!)
  n)

(define (wait-for-monitor next)
  (let ( (t (make-thread (lambda () (monitor next)))) )
    (thread-start! t)
    (thread-yield!)
    (thread-join! t)
    (wait-for-monitor (+ next 1.0)) ) )

(wait-for-monitor 0.0)

yorickhardy · 2024-09-05T21:41:04Z

Hello again,

I am not sure if this is the whole story, but after adding a bit of debug output it seems that the memory allocated in %alloc-thread-data is never freed (in the example "monitor" program above). I am also not sure when Cyc-end-thread! is called.

I am not yet able to make a more meaningful contribution, but I am trying to work towards it!

justinethier · 2024-09-07T02:11:04Z

Hey @yorickhardy, appreciate the update!

That's interesting.... we do malloc a thread data object in %alloc-thread-data. What is supposed to happen is that when the thread ends we call gc_remove_mutator to place that data on a list of old mutators (application threads) to be cleaned up. Eventually that list gets cleaned up by gc_free_old_thread_data() which is called on the main thread after a major GC trace is complete.

I wonder, could it be that major GC is not being triggered by the example program?

yorickhardy · 2024-09-08T16:40:37Z

Yes, that seems to be the beginning of the issue. Am I correct in saying that the collector only starts (sometimes) when allocating scheme objects? In my debugging output, gc_free_old_thread_data() was never called, because the collector remained in the STAGE_RESTING state.

I hacked together a workaround to force the collector out of the STAGE_RESTING state, then the free happens (the workaround is very ugly, I will try to think of a better way to do this); but memory usage still increases -- I still need to check if the vector with the thread information and result is ever freed, that is possibly the second reason for the increasing memory usage.

justinethier · 2024-09-09T01:46:06Z

Correct, the collector will only trigger a major GC when allocating objects and the runtime detects a need to start that collector. For example, a percentage of memory being used up.

yorickhardy · 2024-09-09T22:09:24Z

Thanks.

In gc_thread_data_free, gc_merge_all_heaps adds the terminated thread's heap(s) to the primordial thread. Could this be the largest contribution to the increased usage of memory?

I still need to track down how the heap allocations are eventually freed, and then I can try to force the freeing of memory to see if that shows better memory use for the example program.

This is a first attempt to improve the memory usage reported in issue justinethier#534.

This ensures that the collector has a chance to run whenever a thread exits. Attempts to partially address issue justinethier#534.

When a thread exits, the heap is merged into the main thread. Before doing so, free any unused parts of the heap to reduce memory usage. Attempts to partially address issue justinethier#534.

yorickhardy · 2024-11-15T21:29:11Z

I have started to try to work through this, but I look at it infrequently so I am sure that I have completely missed the mark!
No pull request yet: I am not sure I have understood everything correctly.

The branch is here: https://github.com/yorickhardy/cyclone/commits/threads-gc-work/

The total size is still huge, but the resident memory has improved:

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
 8023 yorick    85    0    19G   72M nanosl/1    0:02  0.00%  0.00% monit

justinethier · 2024-11-18T03:09:30Z

Thanks @yorickhardy this looks promising!

I'm wondering if STAGE_FORCING is necessary; can't we go directly from resting to collection? For example via gc_start_major_collection().

Another thing I'm wondering about is if the pages on the thread heaps are empty or if there are a couple live objects that are causing the memory usage to grow over time.

Use gc_start_major_collection() instead. Partial work towards addressing issue justinethier#534.

yorickhardy · 2024-11-18T21:48:40Z

Thanks!

I added some debugging output and the pages which are merged have 3145536 or 3145600 (of 3145728) bytes remaining (so 192 bytes and 128 bytes used respectively). I would guess fragmentation is becoming a problem here?

The objects are (repeating for each thread created/destroyed):

GC HEAP MERGE: [0] 3145536/3145728
((2 . <port 0x79992f7e0bf0>) (1 . <port 0x79992f7e0ac0>) (0 . <port 0x79992f7e0b58>))

GC HEAP MERGE: [1] 3145600/3145728
<procedure 0x410072>

justinethier · 2024-11-19T03:10:06Z

This is interesting:

GC HEAP MERGE: [0] 3145536/3145728
((2 . <port 0x79992f7e0bf0>) (1 . <port 0x79992f7e0ac0>) (0 . <port 0x79992f7e0b58>))

I would think these would be parameter objects:

    (define current-output-port
      (make-parameter (Cyc-stdout)))
    (define current-input-port
      (make-parameter (Cyc-stdin)))
    (define current-error-port
      (make-parameter (Cyc-stderr)))

Hmm. I was thinking these would eventually be freed because the param_objs of the calling thread no longer exist. However, the gc uses lazy sweeping, so it could be that we never get around to lazy sweeping them and they stick around indefinitely.

Moving the code from gc_merge_all_heaps to gc_heap_merge removes special handling of the start of the list and is (hopefully) easier to read. Partial work towards addressing issue justinethier#534.

Partial work towards addressing issue justinethier#534.

This ensures that any objects which are part of the thread context are transferred to the heap. Partial work towards addressing issue justinethier#534.

This will be used to create the thread context. Partial work towards addressing issue justinethier#534.

Also introduce a global variable to track whether merged heaps need to be swept. Partial work towards addressing issue justinethier#534.

The context ensures that parametrised objects, continuations and exception handlers can still be traced but are no longer root objects (after thread terminations) and can be GCd eventually. Partial work towards addressing issue justinethier#534.

The primordial thread may not have an opportunity to sweep heap pages which have been merged from terminated threads. So sweep any unswept pages during the cooperation phase. Partial work towards addressing issue justinethier#534.

yorickhardy · 2025-01-02T23:09:59Z

A slightly late happy new year!

I have attempted to address this issue, but I am still quite unsure about the correctness of the code (in particular: converting a heap page to free list seems like a bad idea?). The proposed solution is ugly, I hope a better solution can be found.

On my very simple test program above I see improved behaviour. Sometimes the memory usage is a bit high, but it eventually reduces again. I thought it would follow a pattern, but I don't seem to observe one (by eye). The test programs also all pass, but I am not sure that says much!

Any suggestions will be greatly appreciated. Unfortunately I will be quite busy again soon, so I will probably take very long to get around to looking at the issue again (sorry).

justinethier · 2025-01-04T23:34:08Z

Thank you so much for your work on this @yorickhardy! Let me spend time looking this over, maybe tonight but if not definitely tomorrow. I would like to get a PR together if it looks ready, but if not we can see what that will take. I remember when looking at your fork previously there were good improvements.

justinethier · 2025-01-05T20:35:18Z

@yorickhardy After a first pass through everything I think these changes are looking good!

Usually heap pages are free lists, there is an optimization where the page is initially organized into a contiguous block of memory. This allows for faster initial allocations but we need to revert back to a free list for sweeps and longer-term maintenance of the page. Long way of saying, I think what you are doing there is fine.

I was wondering about the extra overhead of sweeping on the main thread. We are already merging everything to that thread, though. And if the extra overhead ever affected program performance, the application could be modified to transition the impacted work to another thread.

The context logic is a good catch however it likely needs to be tweaked a bit. We want to clear it in gc_free_old_thread_data (or such), then allow adding to the context for each merged thread.

I'm inclined to create a PR and work through integrating this into Cyclone.
Any thoughts/concerns from your end?

yorickhardy · 2025-01-10T03:52:34Z

Thanks! I will try to get to it this weekend.

I thought that maybe the heap pages should be owned by the thread which calls thread-join! Perhaps this thread can also take responsibility for freeing the thread data.

* gc: add a function to force the collector to run This requires adding a "forced" stage for the collector, which is the initial stage for a forced collection. Thereafter, the collector continues to the usual stages of collection. * runtime: force the garbage collector to run when a thread exits This is a first attempt to improve the memory usage reported in issue #534. * srfi-18: call Cyc_end_thread on thread exits This ensures that the collector has a chance to run whenever a thread exits. Attempts to partially address issue #534. * gc: free unused parts of the heap before merging When a thread exits, the heap is merged into the main thread. Before doing so, free any unused parts of the heap to reduce memory usage. Attempts to partially address issue #534. * srfi-18: thread-terminate! takes a thread as argument * gc: revert adding STAGE_FORCING Use gc_start_major_collection() instead. Partial work towards addressing issue #534. * gc: free empty pages in gc_heap_merge() Moving the code from gc_merge_all_heaps to gc_heap_merge removes special handling of the start of the list and is (hopefully) easier to read. Partial work towards addressing issue #534. * gc: oops, forgot the "freed" count Partial work towards addressing issue #534. * gc: oops, forgot the "freed" count (again) Partial work towards addressing issue #534. * types: update forward declaration of gc_heap_merge() Partial work towards addressing issue #534. * gc: remove accidental double counting * runtime: small (cosmetic) simplification * srfi-18: add a slot for thread context in the thread object Partial work towards addressing issue #534. * srfi-18: do a minor gc when terminating a thread This ensures that any objects which are part of the thread context are transferred to the heap. Partial work towards addressing issue #534. * types.h: make gc_alloc_pair public This will be used to create the thread context. Partial work towards addressing issue #534. * gc: prepare heap objects for sweeping Also introduce a global variable to track whether merged heaps need to be swept. Partial work towards addressing issue #534. * gc: create a context for terminated thread objects The context ensures that parametrised objects, continuations and exception handlers can still be traced but are no longer root objects (after thread terminations) and can be GCd eventually. Partial work towards addressing issue #534. * gc: sweep and free empty heaps for the primordial thread The primordial thread may not have an opportunity to sweep heap pages which have been merged from terminated threads. So sweep any unswept pages during the cooperation phase. Partial work towards addressing issue #534. * srfi-18: revert thread-terminate! changes These changes need to be revisited, and are not suitable for the threads garbage collection pull request.

justinethier self-assigned this Apr 2, 2024

yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Nov 15, 2024

runtime: force the garbage collector to run when a thread exits

955695f

This is a first attempt to improve the memory usage reported in issue justinethier#534.

yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Nov 15, 2024

srfi-18: call Cyc_end_thread on thread exits

bf9dda2

This ensures that the collector has a chance to run whenever a thread exits. Attempts to partially address issue justinethier#534.

yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Nov 18, 2024

gc: revert adding STAGE_FORCING

48b95db

Use gc_start_major_collection() instead. Partial work towards addressing issue justinethier#534.

yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Dec 31, 2024

gc: oops, forgot the "freed" count

fb334d1

Partial work towards addressing issue justinethier#534.

yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Dec 31, 2024

gc: oops, forgot the "freed" count (again)

20e1a14

Partial work towards addressing issue justinethier#534.

yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Dec 31, 2024

types: update forward declaration of gc_heap_merge()

b755b40

Partial work towards addressing issue justinethier#534.

yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Jan 2, 2025

srfi-18: add a slot for thread context in the thread object

dd2b5c1

Partial work towards addressing issue justinethier#534.

yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Jan 2, 2025

types.h: make gc_alloc_pair public

cd4125b

This will be used to create the thread context. Partial work towards addressing issue justinethier#534.

yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Jan 2, 2025

gc: prepare heap objects for sweeping

028f52c

Also introduce a global variable to track whether merged heaps need to be swept. Partial work towards addressing issue justinethier#534.

yorickhardy mentioned this issue Jan 12, 2025

Improve garbage collection for terminated threads #550

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage and tail calls #534

Memory usage and tail calls #534

yorickhardy commented Mar 25, 2024

justinethier commented Mar 26, 2024

justinethier commented Apr 2, 2024

yorickhardy commented Apr 3, 2024

justinethier commented Apr 4, 2024

justinethier commented Apr 9, 2024

justinethier commented Apr 10, 2024

yorickhardy commented Apr 10, 2024

yorickhardy commented Apr 11, 2024

justinethier commented Apr 15, 2024

justinethier commented Apr 23, 2024

yorickhardy commented Apr 23, 2024

yorickhardy commented Sep 5, 2024

justinethier commented Sep 7, 2024

yorickhardy commented Sep 8, 2024

justinethier commented Sep 9, 2024

yorickhardy commented Sep 9, 2024

yorickhardy commented Nov 15, 2024

justinethier commented Nov 18, 2024

yorickhardy commented Nov 18, 2024

justinethier commented Nov 19, 2024

yorickhardy commented Jan 2, 2025

justinethier commented Jan 4, 2025 •

edited

Loading

justinethier commented Jan 5, 2025

yorickhardy commented Jan 10, 2025

Memory usage and tail calls #534

Memory usage and tail calls #534

Comments

yorickhardy commented Mar 25, 2024

justinethier commented Mar 26, 2024

justinethier commented Apr 2, 2024

yorickhardy commented Apr 3, 2024

justinethier commented Apr 4, 2024

justinethier commented Apr 9, 2024

justinethier commented Apr 10, 2024

yorickhardy commented Apr 10, 2024

yorickhardy commented Apr 11, 2024

justinethier commented Apr 15, 2024

justinethier commented Apr 23, 2024

yorickhardy commented Apr 23, 2024

yorickhardy commented Sep 5, 2024

justinethier commented Sep 7, 2024

yorickhardy commented Sep 8, 2024

justinethier commented Sep 9, 2024

yorickhardy commented Sep 9, 2024

yorickhardy commented Nov 15, 2024

justinethier commented Nov 18, 2024

yorickhardy commented Nov 18, 2024

justinethier commented Nov 19, 2024

yorickhardy commented Jan 2, 2025

justinethier commented Jan 4, 2025 • edited Loading

justinethier commented Jan 5, 2025

yorickhardy commented Jan 10, 2025

justinethier commented Jan 4, 2025 •

edited

Loading