Permanent communication loss after gateway restarts

Home Forums Lora Network Server Permanent communication loss after gateway restarts

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • #23831
    Piotr Diop
    Participant

    Hi,

    I’m facing a communication inconsistencies after the gateway has rebooted, which most often results in permanent communication loss.
    I mean by that that 9 out of 10 times, the lora network server refuses to forward newly received packets to the mqtt broker/my node-red application

    Here’s my setup:
    – MTCAP AEP (Firmware1.4.16, Network Server 2.0.19).
    – End device is connected and joined with OTAA.
    – End device trnasmits small unconfirmed packets roughly every minute.

    I’m assessing the communication behavior after the gateway undergoes a power cycle for some time.
    On a very (very) few occasions, after the GW has rebooted and reloaded all necessary pieces of software, the communication works seemlessly again.
    But most of the time the communication is lost and cannot be recovered unless I force a join request on end-device side.

    I enabled trace logs for network server for what happens when a new unconfirmed packet is sent to it:

    9:36:2:540|TRACE| TX-ACK|127.0.0.1:XXXX 4 bytes 023b7001
    9:36:2:541|TRACE| GW:00-80-00-xx-xx-xx-xx-xx|SEEN|PUSH-DATA|127.0.0.1:XXXX
    9:36:2:541|INFO| GW:00-80-00-xx-xx-xx-xx-xx|FRAME-RX|Parsing 1 packets
    9:36:2:542|DEBUG| GW:00-80-00-xx-xx-xx-xx-xx|FRAME-RX|DATA: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    9:36:2:543|DEBUG| GW:00-80-00-xx-xx-xx-xx-xx|FRAME-RX|FREQ: 868.100000 MHz DR5 RSSI: -43 dB SNR: 92 cB
    9:36:2:543|DEBUG| GW:00-80-00-xx-xx-xx-xx-xx|FRAME-RX|TYPE: Unconfirmed Up
    9:36:2:544|DEBUG| GW:00-80-00-xx-xx-xx-xx-xx|PACKET-RX|ADDR: ZZ-ZZ-ZZ-ZZ FCnt:001a
    9:36:2:544|TRACE| AUTH KEY: xx.xx.xx.xx.xx.xx.xx.xx.xx
    9:36:2:545|TRACE| PMIC: -------- CMIC: --------
    9:36:2:545|WARNING| ED:YY-YY-YY-YY-YY-YY-YY-YY|CHECK-PKT|MIC Check Failed 0x0000001a
    9:36:2:545|INFO| ED:YY-YY-YY-YY-YY-YY-YY-YY|CHECK-PKT|FCNT: 0000001a LAST-FCNT: 00000006 Duplicate: no
    9:36:2:546|INFO| ED:YY-YY-YY-YY-YY-YY-YY-YY|CHECK-MIC|ADDR: ZZ-ZZ-ZZ-ZZ failed
    9:36:2:546|WARNING| ED:YY-YY-YY-YY-YY-YY-YY-YY|DROP-PKT|Addr: ZZ:ZZ:ZZ:ZZ Duplicate: no
    9:36:2:546|DEBUG| GW:00-80-00-xx-xx-xx-xx-xx|FRAME-RX|JSON: {"tmst":418086131,"chan":0,"rfch":0,"freq":868.1,"stat":1,"modu":"LORA","datr":"SF7BW125","codr":"4/5","lsnr":9.2,"rssi":-43,"size":132,"data":"qFDv11YnKbW1LIR..."}
    9:36:4:508|TRACE| TX-ACK|127.0.0.1:TTTT 4 bytes 02642404
    9:36:4:509|TRACE| GW:00-80-00-xx-xx-xx-xx-xx|SEEN|PULL-DATA|127.0.0.1:TTTT
    9:36:6:902|TRACE| TX-ACK|127.0.0.1:XXXX 4 bytes 02119e01
    9:36:6:904|TRACE| GW:00-80-00-xx-xx-xx-xx-xx|SEEN|PUSH-DATA|127.0.0.1:XXXX
    9:36:6:904|TRACE| GW:00-80-00-xx-xx-xx-xx-xx|PUSH-DATA|{"stat":{"time":"2018-06-14 09:36:06 GMT","rxnb":1,"rxok":1,"rxfw":1,"ackr":0.0,"dwnb":0,"txnb":0}}

    I notice that for all failing cases the FCNT keeps incrementing (0000001a..0000001b..0000001c..) forever while LAST-FCNT remains 00000006, and I get the message “MIC Check Failed”

    Three questions:
    1) Why the behavior is not deterministic ?
    FYI I have observed working cases after the GW has power cycled after few hours (100+ of packets sent by the end device in the meantime) as well as few minutes (1-2 packets sent only).

    2) What is the exact expected behavior here ?

    3) How do you ensure in mass production with 100+ nodes that when a gateway is shut down(e.g temporary power loss), the communication keeps going after it is powered on again ?

    Thanks for your insight

    #23832
    Piotr Diop
    Participant

    Just to clarify, I power cycle the gateway by pulling the power cord out, waiting some time, then plugging it back in.

    #23833
    Jason Reiss
    Keymaster

    See in GUI LoRaWAN > Network Settings > Database > Backup Interval
    The default is every 1 hour to backup the database to flash.

    The working database is held in RAM to allow faster reads/writes and reduce wear on the flash. It is only backed up periodically to flash per Backup Interval or when the system is shutdown gracefully.

    The end-devices can check that the network is available by sending a confirmed uplinks periodically. If the network does not respond the end-device can attempt a join to recover.

    The database can be forced to backup with this command
    lora-query -x database backup

    #23836
    Piotr Diop
    Participant

    See in GUI LoRaWAN > Network Settings > Database > Backup Interval
    The default is every 1 hour to backup the database to flash.

    The working database is held in RAM to allow faster reads/writes and reduce wear on the flash. It is only backed up periodically to flash per Backup Interval or when the system is shutdown gracefully.

    Well that explains why the issue was not reproducible when performing a software reboot or when by only restarting the lora server itself.
    Are there some benchmarking stats on flash wear/performance if the backup interval was set to different timings?

    The end-devices can check that the network is available by sending a confirmed uplinks periodically. If the network does not respond the end-device can attempt a join to recover.

    While that would work well in a small sized network, having hundreds of devices sending confirmed messages at an unpredictable rate would become problematic on the long term.
    If I understand correctly the gateway is also subject to duty cycle restrictions (using sliding windows) so this would risk overusing available bandwidth for real downlink messages with the dummy ones used to confirm that the gateway is still alive (which would be the case 99% of time)

    But that probably means that, in fine, there is a need to assess per-device what “periodically” means, depending on how often the device already sends a message and how critical it is to lose few chunks of data.

    I also noted in the loraWAN spec that there are link check commands which kind of follow the same idea. I guess i’ll try to combine it with your suggestion to solve my issue.

    Thanks for your quick answer !

Viewing 4 posts - 1 through 4 (of 4 total)
  • You must be logged in to reply to this topic.