The RunOnlyOnIP functionality works still fine.
But with ticket it was introduced that in a load balanced environment - if the RunOnlyOnIP is null, the scheduler is started just on one server (provided they are connected via hazelcast).
My test case involved the following:
started at the same time two servers linked by hazelcast
check in idempiereMonitor - everything working fine - schedulers just running on one server
restarted the server where the schedulers are not running
all the schedulers now are running twice
Probably the cause is that the schedulers are being started before the server joins the other hazelcast node.
Delay the start of scheduler server to give time to hazelcast to start and join other nodes
Is it possible to know the status of hazelcast from the org.compiere.server.Scheduler or org.adempiere.server.AdempiereServerActivator?
If yes, I think a wait can be implemented until knowing that hazelcast plugin started and it has joined the hazelcast network (or is standalone).
If that's not possible, maybe we can implement some semaphore (maybe could be something like org.adempiere.plugin.utils.AbstractActivator.getLockPO)
Alternatively, add isStarted/isScheduled flag to AD_Scheduler table and use that to avoid more than 1 server start a particular scheduler record.
The problem I see with this alternative is that a crashed server will leave the flag set permanently and would require manual intervention to make it work again.
I’ve push a pull request that added a 1 to 3 minutes delay to wait for hazelcast service.
Thanks , this was tested in the normal single-server scenario. Pending to test with a configuration of hazelcast nodes.
Hi , I found one server where the AdempiereServerMgr didn't start without leaving any clue in the log for the reason.
It seems this code AdempiereMonitor.init:1285:
is chewing the exception and is not noticeable for the user.
Debugging I found the cause to be this exception:
I'm trying to find the root cause to fix - but anyways I think it would be important that the log shows error.
Going deeper in the root cause of the failure the creation of the MSession record is arriving there without a context, the validation showed that
getCtx().isEmpty() -> true
getCtx().getProperty("#AD_Client_ID") == null -> true
I don't know why this happen in one specific server, but not in others, however I think maybe there can be something wrong with the context of the thread opened with the Adempiere.getThreadPoolExecutor().schedule(() ?