How to handle Node-red crashes?
Home › Forums › Conduit: AEP Model › How to handle Node-red crashes?
- This topic has 6 replies, 4 voices, and was last updated 6 years, 2 months ago by
Jeff Hatch.
-
AuthorPosts
-
January 31, 2019 at 6:36 am #27143
Bob Doiron
ParticipantHi,
We’re running a few AEP conduits (FW 1.6.2) and one of them occasionally has issues where our node-red app stops sending messages to our servers. It continues to check in to devicehq and we’re able to schedule a reset to get data flowing again. As far as I can tell from the logs, it seems like node-red either crashes or hangs.I’ve looked at setting up monit to monitor node-red, or some custom script/cron job, but it sounds like any linux mods outside of the config/app would get blown away by an AEP firmware upgrade.
As is, our system will go down for 4 to 8 hours minimum due to the latency of getting a reset request through devicehq.
Any suggestions?
January 31, 2019 at 2:52 pm #27150Jason Reiss
Keymaster
You could create a custom app that performs the monitoring. It would be reinstalled after the firmware upgrade.January 31, 2019 at 3:27 pm #27153Bob Doiron
ParticipantThanks! That looks promising.
February 4, 2019 at 3:02 am #27158Lawrence Griffiths
ParticipantBob do your servers send a response code to updates from Node-Red?
If you do then you could try & count the number of missed and use exe node to a Linux reboot. As it might be more of an connection issues.Also on latest version of AEP the node-red logs are not rotated I would have a look at those.
-
This reply was modified 6 years, 2 months ago by
Lawrence Griffiths.
February 4, 2019 at 5:33 am #27160Bob Doiron
ParticipantYes, our servers send a response code. I already have a watchdog of sorts built in to my node-red flow that will reboot the box if the node-red flow is unable to deliver messages for 4 hours. It also delivers lora stats every 5 minutes as a heartbeat.
Unfortunately, on several occasions we’ve found that the conduit continued to update devicehq, but was no longer delivering data to our api and my node-red watchdog didn’t activate. Nothing was written to the node-red logs either, so I concluded that node-red itself either hung up or crashed.
We’re using 1.6.2 which does have log rotation for node-red. I haven’t tried 1.6.4 yet, but I hope they didn’t un-fix the node-red log rotation. We had previous issues with the node-red log getting too big.
February 4, 2019 at 6:10 am #27161Bob Doiron
ParticipantHas anyone seen documentation for this process?
admin@mtcdt:~# ps -eaf |grep watchdog
admin 3437 1 0 Feb01 ? 00:00:10 watchdog –device /dev/watchdog –ppp
admin 18138 13069 0 12:08 pts/0 00:00:00 grep watchdogadmin@mtcdt:~# which watchdog
/sbin/watchdogadmin@mtcdt:~# watchdog –help
Usage: watchdog
–api (a) : watches and restarts api process
–ddns (i) : watches and restarts ddns process
–ppp (p) : watches and restarts ppp process
–lora (l) : watches and restarts lora process
–node-red (n) : watches and restarts node-red process
–device (d) : path to hardware watchdog device
–help (h) : prints this messageFebruary 4, 2019 at 8:30 am #27163Jeff Hatch
KeymasterBob,
The watchdog process is not documented. If you add the -node-red argument I don’t think it will provide what you need. If the node-red process is just hung and hasn’t actually disappeared this watchdog will not restart it.
BTW, there is a simple process called angel (a link to it called node-angel is used for node-red) is used to restart the node-red process when it terminates. As you have noted, I think something else is going on and the node-red process is getting into some kind of “hung” state.
A couple of things to look at when node-red gets in this state:
1) How much memory is it using: “ps auxww | grep node-angel” output should be able to tell you this.
2) Run top and see if it is consuming lots of CPU.Are you using SSL in node-red, and therefore in node. I have seen node use a lot of memory when doing SSL for some reason, ie. ~150MB
Jeff
-
This reply was modified 6 years, 2 months ago by
-
AuthorPosts
- You must be logged in to reply to this topic.