Watchdog timeout on large point count N4 Supervisors w/ read op #14

tblong · 2022-06-03T15:17:07Z

Related to nHaystack v3.0.1+. When performing a read operation such as read?filter=point+and+cur against a large point count N4 Supervisor (as above), we have seen the watchdog timeout get triggered and the station restarts. The watchdog event occurs even when adding the optional and a low value for the limit parameter.

Questions:

Is a filter based read operation currently executed on the main engine thread within Niagara?
Are there any optimizations to make the read op better here or to ensure the op performs work off the main engine thread?

The text was updated successfully, but these errors were encountered:

ci-richard-mcelhinney · 2022-06-08T04:27:37Z

Hi @tblong,

That is a lot of points! I haven't tested this before with a station of that size and no one else has reported using nhaystack with an application with that many points.

I'll need to have a look at the code and see what we can do.

ci-richard-mcelhinney · 2022-06-28T22:42:27Z

HI @tblong ,

I have done an initial investigation into this situation and making a change to the threading arrangement for the servlet isn't as simple as I first thought. I am going to try to and setup a test station with the number of components you have and also make some changes and see what happens.

This kind of change is quite a significant change so I want to proceed carefully.

ci-richard-mcelhinney · 2022-06-29T05:33:58Z

@tblong I have tried a couple of different setups today. The first setup I built had 250,000 points. I didn't get a watchdog timeout I had out of memory issues.

The second setup I lowered the station to 150,000 points. The 'read' query with your filter worked over the REST API, however it is holding on to a lot of memory.

I'm doing all this on a Windows Virutal Machine on my Mac. It has 16GB RAM and 4 cores allocated. It has the default memory settings for the Station JVM.

Can you provide more details on your Supervisor configuration? I think there is a problem but it's more around memory management at this scale. I'm not seeing Watchdog timeouts and station restarts as you indicate. I am using the latest code though, but I don't think that should have made much of a difference.

ci-richard-mcelhinney · 2022-06-29T07:31:43Z

@tblong also I just tested the use of the limit property with the following query https://localhost/haystack/read?filter=point%20and%20cur&limit=10 and that worked as well, it returned super quick.

tblong · 2022-06-29T17:23:56Z

@ci-richard-mcelhinney Much thanks for the help digging here. So it seems this might just be a max-heap setting perhaps? Possibility for memory improvements in how nHaystack crawls through the station during a read op as it gathers the response data?

I will be on holiday from 6/30->7/7 but will work on getting all the station metrics and config settings I can on my return.

ci-richard-mcelhinney · 2022-06-30T06:21:56Z

@tblong I've also determined that the REST API requests are not serviced on the Engine Thread in the latest code. I'm not sure about the version you are using. So if you can upgrade you should get similar results to me and hopefully you don't see watchdog timeouts.

tblong · 2022-07-08T20:07:52Z

@ci-richard-mcelhinney Got the station metrics gathered below today. We only had browser access for this session so were not able to get what the actual max-heap setting was but were still able to get the memory metrics of the station.

The nHaystack version is v3.2.0:

The spy:sysInfo page with certain properties redacted:

The spy:util/gc page after forcing a garbage collect:

Let me know if there might be any other metric I can grab that will help further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watchdog timeout on large point count N4 Supervisors w/ read op #14

Watchdog timeout on large point count N4 Supervisors w/ read op #14

tblong commented Jun 3, 2022 •

edited

Loading

ci-richard-mcelhinney commented Jun 8, 2022

ci-richard-mcelhinney commented Jun 28, 2022

ci-richard-mcelhinney commented Jun 29, 2022

ci-richard-mcelhinney commented Jun 29, 2022

tblong commented Jun 29, 2022

ci-richard-mcelhinney commented Jun 30, 2022

tblong commented Jul 8, 2022

Watchdog timeout on large point count N4 Supervisors w/ read op #14

Watchdog timeout on large point count N4 Supervisors w/ read op #14

Comments

tblong commented Jun 3, 2022 • edited Loading

ci-richard-mcelhinney commented Jun 8, 2022

ci-richard-mcelhinney commented Jun 28, 2022

ci-richard-mcelhinney commented Jun 29, 2022

ci-richard-mcelhinney commented Jun 29, 2022

tblong commented Jun 29, 2022

ci-richard-mcelhinney commented Jun 30, 2022

tblong commented Jul 8, 2022

tblong commented Jun 3, 2022 •

edited

Loading