Check if Actyx Node is reachable before emitting an event

Hi,
We have a handshake between a machine (ActyxOS Node) and our UI (ActyxOS Node). With this handshake the ui will only register an order when the handshake is completed. This can cause weird behaviour for the enduser when the machine node is not reachable (e.g. Order will suddenly appear when the machine node is reachable again). Therefor I want to check if the machine node is reachable before emitting the event which starts the handshake.

How can I access all nodes to check if my machine node is reachable?
I’m not able to use http://localhost:4457/_internal/swarm/state anymore since I updated to Pond v3 and ActyxOS v2. So i assume there is another way now?

Best regards
Sebastian Stumm

Hi Sebastian,

Relying on the reachability of another node is very fragile. There may be a more idiomatic approach to implementing the logic you envision.

Could you describe a little what the goal of the handshake and the connectivity check is?

Thanks,
Oliver

Hi Oliver,

The handshake ensures that the order can’t be registered twice due to connection issues (even if it’s just localy on a device). The machine node is the one which actually registeres the order and checks the consistency of the operation. It’s local state is the state which will be transfered to our database and the customers erp system. So if multiple terminals (tablets) try to register the same order the machine node rejects one based on it’s local consistency check. Which works as intended but lacks in usability when the machine is not reachable.

When the machine is not reachable the order remains in a “requestedToStart” state. As soon as the machine node is back online the machine node gets the “requestedToStart” event synced and either accepts or decline the order. If it’s accepted the order will pop up in the terminal. Which is okay if the connection was only shortly interrupted but if theres a longer disconnect, this behaviour is really odd. Therefor the idea was to check if the node is reachable before emitting the request event. Basically denying to register an order if the machine is not in reach.

Hi Sebastian,

thanks for the additional details! Just to make sure we’re not talking past each other:

  • UI nodes allow humans to say “that machine is working on order X”
  • machine twin has some validation logic that decides on the effect of these assignments (may decline them)
  • ERP exporter (same as DB exporter) takes the machine’s decisions and updates external system

From the UI perspective the user wants some feedback, so once the machine “answers” you want to show the response. As long as the machine doesn’t answer, the user won’t get a response (no matter how you do it).

Finding out whether the machine can currently answer is the same as making a request, so adding another mechanism doesn’t really change the situation: even if the pre-flight check came back okay, the real request may still not work due to some network issue. My recommendation therefore is to send the assignment event in any case, transition the UI into a “pending” state and perhaps allow the user to send a cancellation event to get out of that state. Technically speaking, once the network works again, both events (assignment and cancellation) will very likely be sent to the machine together, so they will be processed one right after the other.

From the machine’s perspective, you just need to code your validation logic, which will observe all assignment requests from UI nodes and emit confirmation or rejection events accordingly. The ERP exporter then books the confirmations into the external system.


In general, the idea of Actyx is that you never check whether something else is reachable before doing what you locally decided is the right thing to do. You record your decisions and let others react to them once they receive the information.

Thanks, @roland!

@SebastianS,

What you are building is completely right. If the machine is the entity that makes the decision, you are rightly processing the respective events in the machine.

Two follow-up questions from the requirements side to make sure I understand you correctly:

  1. Are you trying to avoid people being shown the pending state of an assignment for too long?
  2. Would you prefer people not be able to assign orders to a machine that hasn’t been online for X minutes?

I am asking as there may be a simple solution to achieve this.

Thanks,
Oliver

Hi @SebastianS,

after thinking about the issue a bit more, especially point 2 from Oli above, there is another aspect that I didn’t cover previously — and that you’re possibly more interested in. The human in front of the UI should be presented with a reasonable choice, in this case a reasonably fruitful selection of machines that are likely to respond. As in any distributed system, such a selection can only be done based on past information received from the machines, for example by emitting heartbeat events and filtering based on the youngest heartbeat timestamp that the UI node has seen from each of them.

Concretely, if each machine regularly emitted events, the UI could filter the available machines by checking the machine twin’s “latest seen” timestamp. The twin would need to update this whenever it processes an event coming from the machine gateway.

If there is no other reason for regularly emitting events, then dedicated heartbeat events could be added. But this feels more like a missing Actyx feature: what if you had an API that would let you assess when the local node has last heard from a given node or twin? We haven’t fleshed out this feature yet, but we’re currently thinking about how the new and better version of the old connectivity status API might look and feel. We’d very much welcome your input!

Regards,

Roland

1 Like

@SebastianS here is an example of a pure heartbeat implementation for the machine fish in the advanced example: Comparing master...ost/machine-heartbeat-example · Actyx/advanced-example · GitHub.

Depending on the number of machines and the frequency of the heartbeat this could generate a lot of events. That’s not necessarily a problem, but worth considering.

As @roland noted, we will think about implementing this inside Actyx — we can do this much more efficiently internally. Any input or wishes from your side are appreciated and can help us build something that perfectly solves your problem.

This is indeed very important feature and it would be great to have this as a standard api call. I think that in many use cases it is important to know which machine is on, which forklift is in service and so on.

1 Like

Noted @Mugdin, thanks!

Hi @roland,

I agree that in a pure actyx environment sending an cancelation event would be the correct way to handle this.

In case of a cancelation the request still was emitted.
If the request passes, the logic which triggers the bookings for the erp system and database would not know that this request was canceled in the following event and would export the data. In this case we would have to revert our changes in the database and in the erp system of the customer. This is not feasible for every canceled request.
That’s mainly the reason why we wanted to check the connectivity first.

HI @oli,

  1. Yes I’m trying to avoid a long pending state, e.g. by a timeout for the request
  2. If I can timeout the request, no.

Hi @SebastianS,

yes, I agree that checking first is a good way of reducing the frequency of weird things happening. Just note that no matter which technology you use, if you don’t hear back from a transaction you can never be sure what its outcome was (e.g. database applies transaction but someone pulls the network plug before you get the confirmation, which then dies in some TCP buffer; or the same with an HTTP response that got lost, possibly due to a timeout).

Regarding the cancellation’s effect: if you implemented the ERP connector as a fish, then the most likely scenario is that you get both the request and the cancellation before the next observable Fish state is published, so no booking would be generated. This is because newly received events are most likely applied en bloc.

Regarding timing out the request: you can expose the emission timestamp of the request’s event in your machine fish and not turn it into an assignment event if it was too old. In that case I’d write a rejection event so that the UI node can inform the human operator that the request was not successful. In this case you’ll want to ensure proper NTP synchronisation of all devices, though, because otherwise clock skew can make it impossible to send successful requests (you probably already do this, just mentioning for completeness).

Regards,

Roland