Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests] 24.10 Stage Two tests failing for logs #3853

Open
arif-ali opened this issue Nov 26, 2024 · 9 comments
Open

[tests] 24.10 Stage Two tests failing for logs #3853

arif-ali opened this issue Nov 26, 2024 · 9 comments

Comments

@arif-ali
Copy link
Member

There's been a few occurances now, and have found that the test below failing, and on restart of the job, it works as normal

066-tests_report_tests_plugin_tests_logs.py_JournalSizeLimitTest.test_journal_tailed_and_linked

debug.log from avocodo

[stdlog] 2024-11-26 08:43:06,719 avocado.test test             L0313 INFO | INIT 1-tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_tailed_and_linked
[stdlog] 2024-11-26 08:43:06,721 avocado.test parameters       L0141 DEBUG| PARAMS (key=timeout, path=*, default=None) => None
[stdlog] 2024-11-26 08:43:06,721 avocado.test parameters       L0141 DEBUG| PARAMS (key=timeout_factor, path=*, default=1.0) => 1.0
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0345 DEBUG| Test metadata:
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0347 DEBUG|   filename: /tmp/cirrus-ci-build/tests/report_tests/plugin_tests/logs.py
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0353 DEBUG|   teststmpdir: /var/tmp/avocado_30pafaq7
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0354 DEBUG|   original timeout: None
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0355 DEBUG|   timeout factor: 1.0
[stdlog] 2024-11-26 08:43:06,721 avocado.test test             L0356 DEBUG|   actual timeout: None
[stdlog] 2024-11-26 08:43:06,722 avocado.test test             L0548 INFO | START 1-tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_tailed_and_linked
[stdlog] 2024-11-26 08:43:06,724 avocado.test stacktrace       L0040 ERROR| 
[stdlog] 2024-11-26 08:43:06,725 avocado.test stacktrace       L0042 ERROR| Reproduced traceback from: /usr/local/lib/python3.12/dist-packages/avocado/core/test.py:638
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR| Traceback (most recent call last):
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR|   File "/tmp/cirrus-ci-build/tests/report_tests/plugin_tests/logs.py", line 85, in test_journal_tailed_and_linked
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR|     self.assertFileExists(tailed)
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR|   File "/tmp/cirrus-ci-build/tests/sos_tests.py", line 355, in assertFileExists
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR|     assert os.path.exists(fname), f"{fname} does not exist"
[stdlog] 2024-11-26 08:43:06,726 avocado.test stacktrace       L0049 ERROR|            ^^^^^^^^^^^^^^^^^^^^^
[stdlog] 2024-11-26 08:43:06,727 avocado.test stacktrace       L0049 ERROR| AssertionError: /var/tmp/avocado_30pafaq7sos_tests.py.JournalSizeLimitTest/sosreport-JournalSizeLimitTest/sosrepor
t-cirrus-task-5545356358844416-2024-11-26-tsuiwhi/sos_strings/logs/journalctl_--no-pager.tailed does not exist
[stdlog] 2024-11-26 08:43:06,727 avocado.test stacktrace       L0050 ERROR| 
[stdlog] 2024-11-26 08:43:06,727 avocado.test test             L0642 DEBUG| Local variables:
[stdlog] 2024-11-26 08:43:06,746 avocado.test test             L0645 DEBUG|  -> self <class 'logs.JournalSizeLimitTest'>: 1-tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_
journal_tailed_and_linked
[stdlog] 2024-11-26 08:43:06,746 avocado.test test             L0645 DEBUG|  -> tailed <class 'str'>: /var/tmp/avocado_30pafaq7sos_tests.py.JournalSizeLimitTest/sosreport-JournalSizeLimitTes
t/sosreport-cirrus-task-5545356358844416-2024-11-26-tsuiwhi/sos_strings/logs/journalctl_--no-pager.tailed
[stdlog] 2024-11-26 08:43:06,746 avocado.test test             L0740 ERROR| FAIL 1-tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_tailed_and_linked -> AssertionErr
or: /var/tmp/avocado_30pafaq7sos_tests.py.JournalSizeLimitTest/sosreport-JournalSizeLimitTest/sosreport-cirrus-task-5545356358844416-2024-11-26-tsuiwhi/sos_strings/logs/journalctl_--no-pager
.tailed does not exist
[stdlog] 2024-11-26 08:43:06,746 avocado.test test             L0733 INFO | 

The contents of the logs folder

total 1336
drwx------ 2 arif arif   4096 Nov 26 08:43 .
drwxr-xr-x 3 arif arif   4096 Nov 26 08:43 ..
-rw-r--r-- 1 arif arif     61 Nov 26 08:43 journalctl_--disk-usage
-rw-r--r-- 1 arif arif 668618 Nov 26 08:43 journalctl_--no-pager
-rw-r--r-- 1 arif arif 602485 Nov 26 08:43 journalctl_--no-pager_--boot
-rw-r--r-- 1 arif arif  66089 Nov 26 08:43 journalctl_--no-pager_--boot_-1
-rw-r--r-- 1 arif arif   4280 Nov 26 08:43 ls_-alZR_.var.log

We can see that the file never got created, and hence the test failed. Needs further investigation on what is going on here

@pmoravec
Copy link
Contributor

I can reproduce it on upstream sos:

# PYTHONPATH=tests/ avocado run -p TESTLOCAL=true --test-runner=runner -t stagetwo tests/report_tests/plugin_tests/logs.py
JOB ID     : a6e312d650f35bed6cc6bb6110c893c60c92b956
JOB LOG    : /root/avocado/job-results/job-2024-11-26T11.19-a6e312d/job.log
 (1/3) tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_size_limit: PASS (5.63 s)
 (2/3) tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_journal_tailed_and_linked: FAIL: /var/tmp/avocado_af3k3wvgsos_tests.py.JournalSizeLimitTest/sosreport-JournalSizeLimitTest/sosreport-pmoravec-rhel9-2024-11-26-ckscbzp/sos_strings/logs/journalctl_--no-pager.tailed does not exist (0.02 s)
 (3/3) tests/report_tests/plugin_tests/logs.py:JournalSizeLimitTest.test_archive_created: PASS (0.01 s)
RESULTS    : PASS 2 | ERROR 0 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB TIME   : 6.24 s
#

I will try to dig into it..

@jcastill
Copy link
Member

I think the problem may be that the log is not always big enough to be tailed. I ran the tests in a RHEL9 machine and it succeeded, with a tailed file of 20M:

$ ll sos_strings/logs/ 
total 40960
-rw-r--r--. 1 root root 20969983 Nov 26 11:14 journalctl_--no-pager_--boot.tailed
-rw-r--r--. 1 root root 20969983 Nov 26 11:14 journalctl_--no-pager.tailed

But in the logs that @arif-ali pasted, the files seem to be smaller than that.
@pmoravec how big were the files on your side?

@arif-ali
Copy link
Member Author

Maybe the runs are too quick on GCE that we're not catching it any more. I've run the same stagetwo tests locally 6 times on a fresh VM every single time, and I can't reproduce the problem either.

So, maybe we need some kind of wait for this particular test for the test to be successful?

In my test, I am doing some automation, and creating a fresh new VM every time. So, 6 fresh VMs

@pmoravec
Copy link
Contributor

pmoravec commented Nov 26, 2024

Indeed, in my case the journal is small:

# journalctl --no-pager | wc
  34466  339490 4699809
# journalctl --disk-usage
Archived and active journals take up 24.1M in the file system.
# 

(it's RHEL9)

Also the different size of journal indirectly explains the randomness in job success/failure.

@jcastill
Copy link
Member

I cannot reproduce with 24.10 or Fedora 41 either.
@pmoravec what OS did you reproduce this on? And what do you get with 'journalctl --disk-usage'?
And the problem here is that if your journal was small already, it should have attempted to make it bigger via:

https://github.com/sosreport/sos/blob/main/tests/report_tests/plugin_tests/logs.py#L62

        sosfd = journal.stream('sos-testing')
        rsize = 10 * 1048576
        for _ in range(2):
            # generate 10MB, write it, then write it in reverse.
            # Spend less time generating new strings
            rand = ''.join(
                random.choice(ascii_uppercase + digits) for _ in range(rsize))
            sosfd.write(rand + '\n')
            # sleep to avoid burst rate-limiting
            sleep(5)
            sosfd.write(rand[::-1] + '\n')

@pmoravec
Copy link
Contributor

I see where the problem is but dont have an idea how to reliably reproduce. Since after rebooting my system, the test succeeds - after it generated journal logs (which didnt happen before the reboot!). The problem is/was here:

# journalctl --no-pager | wc
  34466  339490 4699809
# journalctl --disk-usage
Archived and active journals take up 24.1M in the file system.
# 

The later is tested by sos:

    from systemd import journal  # pylint: disable=import-error
    _reader = journal.Reader()
    _size = _reader.get_usage() / 1024 / 1024
    if _size > 20:
        return

As size was >20MB, we did not generate the extra logs. While collected journalctl --no-pager was shorter.

But how that is/was possible..?

@pmoravec
Copy link
Contributor

I dont know why or how, but after playing with journalctl --vacuum-size=10M and similar just to purge some logs and rebooting the RHEL9 system (no persistent journal storage), I do see now:

# journalctl --disk-usage ; journalctl | wc
Archived and active journals take up 8.0M in the file system.
   1966   23768  288005
# 

The journal logs contain a regular boot logs and almost nothing else. I think something similar happens in the CI tests..

@pmoravec
Copy link
Contributor

While I am unable to have a reliable reproducer, it seems there are ways of having journal.Reader().get_usage() over 20MB but the journal logs being smaller. Is it worth replacing the if _size > 20: test by calling journalctl --no-pager | wc -c-like test?

@jcastill
Copy link
Member

I think that may work. The problem here may be that Reader() gets the size of all journal files (maybe even rotated ones?) and I imagine it contains some metadata, and perhaps pre-allocated space for the journal, while 'journalctl' I imagine provides just text, formatted for human consumption. But I'll need to investigate this more to see if that's the case or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants