Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage and tail calls #534

Open
yorickhardy opened this issue Mar 25, 2024 · 20 comments
Open

Memory usage and tail calls #534

yorickhardy opened this issue Mar 25, 2024 · 20 comments
Assignees

Comments

@yorickhardy
Copy link
Contributor

Hello,

I am trying out (srfi 18) in cyclone and noticed that memory was being "consumed" very quickly by a simple monitoring thread, the following small example seems to exhibit this behaviour:

(define (busy x) (busy (+ x 1)))
(busy 0)

Similarly, the cyclone compiler can be encouraged to use excessive memory (and compile time) by:

(define (compile-forever x) x (compile-forever x))
(compile-forever 0)

Both examples work as expected in icyc (with regards to memory use).

Am I doing something unreasonable here? I am using cyclone-0.36.0 on NetBSD 10.99.10 amd64 (current-ish).

@justinethier
Copy link
Owner

Hi @yorickhardy. First of all, thanks for the report! These both seem like real issues, let me find some time this week to take a closer look.

@justinethier
Copy link
Owner

Regarding:

(define (busy x) (busy (+ x 1)))
(busy 0)

Suspect this is related to the amount of memory added to a thread after major GC. Need to look into this more.

Regarding the compile-forever example, there is a problem with the CPS optimization beta expansion phase where it is failing to detect a recursive call, and executing forever. Looks like the problem (or at least part of it) is that analyze:find-recursive-calls does not properly handle edge cases where an ast:lambda call is made, or later on when the optimizer changes such calls to use Cyc-seq, EG:

(analyze:find-recursive-calls                                                    
  scan                                                                           
  app                                                                            
  (Cyc-seq x$1$2 (compile-forever k$5 x$1$2))) 

@justinethier justinethier self-assigned this Apr 2, 2024
justinethier added a commit that referenced this issue Apr 3, 2024
Perform full scanning of function application list to ensure self-recursive calls are found. This prevents infinite loops in the beta expansion code when compiling simple recursive calls.
@yorickhardy
Copy link
Contributor Author

Thanks!

I am sorry that I have not contributed any fixes, I am going to (eventually) try to put more time into understanding cyclone and help where/if I can.

@justinethier
Copy link
Owner

No worries, and bug reports are always appreciated! I would welcome fixes as well, however, these two issues in particular are in areas that would be difficult to track down... and I still need to investigate the first one :)

@justinethier
Copy link
Owner

As this runs, + will eventually cause a conversion from fixnum integer to a bignum. I suspect that cyclone is inefficiently allocating memory for bignums in this use case and that is the cause of the high memory usage.

Consider the memory usage when using fixnums or doubles:

(import (srfi 143))                                                             
(define (busy x) (busy (fx+ x 1)))                                              
(busy 0)    
(define (busy x) (busy (+ x 1.0)))                                              
(busy 0.0)    

That said, I suspect we can do better here, especially since the interpreters can. Will need to spend time looking into this further.

@justinethier
Copy link
Owner

Snippet of code from Cyc_sum:

    } else if (tx == bignum_tag && ty == -1) { \                                 
        BIGNUM_CALL(mp_init(&bn_tmp2)); \                                        
        Cyc_int2bignum(obj_obj2int(y), &bn_tmp2); \                               
        BIGNUM_CALL(BN_OP(&(x->bignum_t.bn), &bn_tmp2, &(x->bignum_t.bn))); \    
        mp_clear(&bn_tmp2); \                                                    

Compare with code from Cyc_fast_sum used in compiled code:

  if (is_object_type(x) && type_of(x) == bignum_tag) {                           
    if (obj_is_int(y)) {                                                         
      mp_int bny;                                                                
      BIGNUM_CALL(mp_init(&bny));                                                
      Cyc_int2bignum(obj_obj2int(y), &bny);                                      
      alloc_bignum(data, bn);                                                    
      BIGNUM_CALL(mp_add(&bignum_value(x), &bny, &bignum_value(bn)));            
      mp_clear(&bny);                                                            
      return bn;

Questions: Why are we doing an allocation here but not above, and can we safely speed up / optimize the latter code?

@yorickhardy
Copy link
Contributor Author

I did not realize that it had switched over to bignum! Thanks. I have more or less isolated the memory usage problems that I was encountering (firstly, I was using many short lived threads instead of a thread pool and assumed that thread-join would (eventually) garbage collect the thread: that assumption is false as far as I can tell). Here is another example with bounded(?) but large memory use:

;  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
; 9949 yorick    25    0   556M  425M CPU/1       0:17 97.91% 56.10% test
(define (loop-test)
  (let ( (o (open-output-string)) )
    (display "abc" o)
    (close-output-port o)
    (loop-test) ) )

(loop-test)

and similarly with constantly growing memory use:

;  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
; 3563 yorick    26    0    13G   10G CPU/0       0:49   164%   128% test

(define output-port (make-parameter #f))

(define (loop-test)
  (parameterize ( (output-port (open-output-string)) )
    (display "abc" (output-port))
    (close-output-port (output-port))
    (loop-test) ) )

(loop-test)

which surprised me! This simple example grows a bit slower but in the same way:

(define a-string (make-parameter #f))

(define (loop-test)
 (parameterize ( (a-string "123") )
  (loop-test) ) )

(loop-test)

Strangely, icyc manages fine. I guess the use of parameterize here is not great since one can easily re-write this example to avoid the nested parameterizations.

I am not sure whether these examples are helpful, or just examples of poorly written scheme!

At this point I am not convinced that the remaining part is a valid issue, or poor programming on my part - so please close the issue if you are satisfied.

@yorickhardy
Copy link
Contributor Author

Hello again,

The memory consumption that motivated this issue is all due to the use of many threads, via srfi-18 and Cyc_scm_call. I have switched to thread pools and Cyc_scm_call_no_gc and the memory consumption is now normal.

I was very surprised that terminated threads consume memory. Nevertheless, I don't think the issue is entirely valid as reported, since the underlying problem was threads (not GC and tail calls). Apologies if I have wasted too much of your time.

(I do appreciate that you have looked into the reported examples and that you are willing to investigate improvements.)

Thanks!

@justinethier
Copy link
Owner

Glad you got it working @yorickhardy!

I appreciate your feedback and think there are genuine issues that are being raised here, though I have not dug into your latest loop-test examples. Let me spend some time looking into everything you brought up to see what can be improved.

@justinethier
Copy link
Owner

@yorickhardy Do you have an example of thread-join not freeing up memory?

@yorickhardy
Copy link
Contributor Author

Sure! I hope I have not missed anything obvious ...

(import (scheme base) (scheme write) (srfi 18))

(define (monitor n)
  (thread-yield!)
  n)

(define (wait-for-monitor next)
  (let ( (t (make-thread (lambda () (monitor next)))) )
    (thread-start! t)
    (thread-yield!)
    (thread-join! t)
    (wait-for-monitor (+ next 1.0)) ) )

(wait-for-monitor 0.0)

@yorickhardy
Copy link
Contributor Author

Hello again,

I am not sure if this is the whole story, but after adding a bit of debug output it seems that the memory allocated in %alloc-thread-data is never freed (in the example "monitor" program above). I am also not sure when Cyc-end-thread! is called.

I am not yet able to make a more meaningful contribution, but I am trying to work towards it!

@justinethier
Copy link
Owner

Hey @yorickhardy, appreciate the update!

That's interesting.... we do malloc a thread data object in %alloc-thread-data. What is supposed to happen is that when the thread ends we call gc_remove_mutator to place that data on a list of old mutators (application threads) to be cleaned up. Eventually that list gets cleaned up by gc_free_old_thread_data() which is called on the main thread after a major GC trace is complete.

I wonder, could it be that major GC is not being triggered by the example program?

@yorickhardy
Copy link
Contributor Author

Yes, that seems to be the beginning of the issue. Am I correct in saying that the collector only starts (sometimes) when allocating scheme objects? In my debugging output, gc_free_old_thread_data() was never called, because the collector remained in the STAGE_RESTING state.

I hacked together a workaround to force the collector out of the STAGE_RESTING state, then the free happens (the workaround is very ugly, I will try to think of a better way to do this); but memory usage still increases -- I still need to check if the vector with the thread information and result is ever freed, that is possibly the second reason for the increasing memory usage.

@justinethier
Copy link
Owner

Correct, the collector will only trigger a major GC when allocating objects and the runtime detects a need to start that collector. For example, a percentage of memory being used up.

@yorickhardy
Copy link
Contributor Author

Thanks.

In gc_thread_data_free, gc_merge_all_heaps adds the terminated thread's heap(s) to the primordial thread. Could this be the largest contribution to the increased usage of memory?

I still need to track down how the heap allocations are eventually freed, and then I can try to force the freeing of memory to see if that shows better memory use for the example program.

yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Nov 15, 2024
This is a first attempt to improve the memory usage reported in
issue justinethier#534.
yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Nov 15, 2024
This ensures that the collector has a chance to run whenever
a thread exits. Attempts to partially address issue justinethier#534.
yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Nov 15, 2024
When a thread exits, the heap is merged into the main thread.
Before doing so, free any unused parts of the heap to reduce
memory usage. Attempts to partially address issue justinethier#534.
@yorickhardy
Copy link
Contributor Author

I have started to try to work through this, but I look at it infrequently so I am sure that I have completely missed the mark!
No pull request yet: I am not sure I have understood everything correctly.

The branch is here: https://github.com/yorickhardy/cyclone/commits/threads-gc-work/

The total size is still huge, but the resident memory has improved:

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
 8023 yorick    85    0    19G   72M nanosl/1    0:02  0.00%  0.00% monit

@justinethier
Copy link
Owner

Thanks @yorickhardy this looks promising!

I'm wondering if STAGE_FORCING is necessary; can't we go directly from resting to collection? For example via gc_start_major_collection().

Another thing I'm wondering about is if the pages on the thread heaps are empty or if there are a couple live objects that are causing the memory usage to grow over time.

yorickhardy added a commit to yorickhardy/cyclone that referenced this issue Nov 18, 2024
Use gc_start_major_collection() instead. Partial work towards
addressing issue justinethier#534.
@yorickhardy
Copy link
Contributor Author

Thanks!

I added some debugging output and the pages which are merged have 3145536 or 3145600 (of 3145728) bytes remaining (so 192 bytes and 128 bytes used respectively). I would guess fragmentation is becoming a problem here?

The objects are (repeating for each thread created/destroyed):

GC HEAP MERGE: [0] 3145536/3145728
((2 . <port 0x79992f7e0bf0>) (1 . <port 0x79992f7e0ac0>) (0 . <port 0x79992f7e0b58>))

GC HEAP MERGE: [1] 3145600/3145728
<procedure 0x410072>

@justinethier
Copy link
Owner

This is interesting:

GC HEAP MERGE: [0] 3145536/3145728
((2 . <port 0x79992f7e0bf0>) (1 . <port 0x79992f7e0ac0>) (0 . <port 0x79992f7e0b58>))

I would think these would be parameter objects:

    (define current-output-port
      (make-parameter (Cyc-stdout)))
    (define current-input-port
      (make-parameter (Cyc-stdin)))
    (define current-error-port
      (make-parameter (Cyc-stderr)))

Hmm. I was thinking these would eventually be freed because the param_objs of the calling thread no longer exist. However, the gc uses lazy sweeping, so it could be that we never get around to lazy sweeping them and they stick around indefinitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants