Libp2p_bitswap::behaviour: bitswap inbound failure

Hi,
unsere Actyx Nodes haben in letzter Zeit öfter diesen Fehler und danach funkioniert garnichts mehr:

Oct 28 16:49:53.253 ERROR libp2p_bitswap::behaviour: bitswap inbound failure 12D3KooWSwZabKjVucEBmKoTNwwpUzqPQKptq6NUoUjQf6QmUadh 23632 Timeout
Oct 28 16:49:53.253 ERROR libp2p_bitswap::behaviour: bitswap inbound failure 12D3KooWSwZabKjVucEBmKoTNwwpUzqPQKptq6NUoUjQf6QmUadh 23635 ConnectionClosed
Oct 28 16:49:53.253 ERROR libp2p_bitswap::behaviour: bitswap inbound failure 12D3KooWSwZabKjVucEBmKoTNwwpUzqPQKptq6NUoUjQf6QmUadh 23636 ConnectionClosed
Oct 28 16:49:53.253 ERROR libp2p_bitswap::behaviour: bitswap inbound failure 12D3KooWSwZabKjVucEBmKoTNwwpUzqPQKptq6NUoUjQf6QmUadh 23634 ConnectionClosed
Oct 28 16:49:53.253 ERROR libp2p_bitswap::behaviour: bitswap inbound failure 12D3KooWSwZabKjVucEBmKoTNwwpUzqPQKptq6NUoUjQf6QmUadh 23633 ConnectionClosed
Oct 28 16:49:53.253 ERROR libp2p_bitswap::behaviour: bitswap inbound failure 12D3KooWSwZabKjVucEBmKoTNwwpUzqPQKptq6NUoUjQf6QmUadh 23631 Timeout
Oct 28 16:49:53.253 ERROR libp2p_bitswap::behaviour: bitswap inbound failure 12D3KooWSwZabKjVucEBmKoTNwwpUzqPQKptq6NUoUjQf6QmUadh 23637 Timeout
Oct 28 16:49:53.253 ERROR libp2p_bitswap::behaviour: bitswap inbound failure 12D3KooWHazVC5Lc3wnkF8mMVHJkrGB6fNWpTxcXWLAyLhuMkVgo 23638 Timeout

Teilweise hilft nur das komplette Löschen des actyx-data Ordners. Die verwendete Actyx Version ist 2.8.1.

This looks like a node not being able to sync a data block from another node due to network failures or some overload problem on the other node. Are all nodes of the same version?

One consequence of this will be that data flow from the other node to this node will not work while the problem persists. Communication with other nodes should keep working. Therefore it would be valuable to know what exactly “danach funktioniert gar nichts mehr” means.

Whoops sorry! Thursday was a long day.
Saying ‘nothing works after that error’ means 'no further events get forwarded to the connected apps.
All nodes are running 2.8.1. Most of them are linux based, 2 are windows based. One node is running the android version ( patched for Android 11 ).

I was hoping for some more feedback on what kind of debug output you guys might need from me.

Thanks for the details, Frank! It would be great if you could upload the full logs around one or two such incidents, ideally with DEBUG level (in /admin/logLevels/node) — which raises the question of how reproducible this issue is. What is the total swarm size?

It would also be very helpful to get the output of ax nodes inspect for the node having that problem as well as some others — in particular those that are not listed as connections at the troubled one.

After some checkings our guess for now is the following:

One of our nodes had run out of diskspace. This could be the cause of the problem. Perhaps you can try to reproduce that?

Ah yes, that condition can cause any number and kind of unforeseeable issues. After fixing that, does it work again? I’m not sure we can ever make Actyx work reliably after encountering the “no disk space” error, there are just too many things inside Linux that don’t tolerate this condition.

well we are still working on getting the customer to increase the HDD size, i’ll get back to you.

mind you: the errors shown in the first post were on a different node that was connected to the one with the full HDD

Yes, that’s somewhat expected: one node has trouble and gets into a weird state, all other nodes now can’t get data from the stuck node anymore.