Schedulers launched twice - problem with RunOnlyOnIP null running on load balancer

Description

The RunOnlyOnIP functionality works still fine.

But with ticket it was introduced that in a load balanced environment - if the RunOnlyOnIP is null, the scheduler is started just on one server (provided they are connected via hazelcast).

My test case involved the following:

  • started at the same time two servers linked by hazelcast

  • check in idempiereMonitor - everything working fine - schedulers just running on one server

  • restarted the server where the schedulers are not running

  • all the schedulers now are running twice

Probably the cause is that the schedulers are being started before the server joins the other hazelcast node.

Environment

None

Activity

Show:
Heng Sin Low
October 1, 2020, 9:36 AM

Hi ,

Thinking of 2 possible solutions below:

  • Delay the start of scheduler server to give time to hazelcast to start and join other nodes

  • Alternatively, add isStarted/isScheduled flag to AD_Scheduler table and use that to avoid more than 1 server start a particular scheduler record.

WDYT ?

Regards,

Low

Carlos Ruiz
October 1, 2020, 9:57 AM

Hi ,

Delay the start of scheduler server to give time to hazelcast to start and join other nodes

Is it possible to know the status of hazelcast from the org.compiere.server.Scheduler or org.adempiere.server.AdempiereServerActivator?

If yes, I think a wait can be implemented until knowing that hazelcast plugin started and it has joined the hazelcast network (or is standalone).

If that's not possible, maybe we can implement some semaphore (maybe could be something like org.adempiere.plugin.utils.AbstractActivator.getLockPO)

Alternatively, add isStarted/isScheduled flag to AD_Scheduler table and use that to avoid more than 1 server start a particular scheduler record.

The problem I see with this alternative is that a crashed server will leave the flag set permanently and would require manual intervention to make it work again.

Regards,

Carlos Ruiz

Heng Sin Low
October 13, 2020, 6:14 AM

Hi ,

I’ve push a pull request that added a 1 to 3 minutes delay to wait for hazelcast service.

Regards,

Low

Carlos Ruiz
October 17, 2020, 8:35 PM

Thanks , this was tested in the normal single-server scenario. Pending to test with a configuration of hazelcast nodes.

Regards,

Carlos Ruiz

Assignee

Heng Sin Low

Reporter

Carlos Ruiz

Labels

None

Tested By

None

Affects versions

Priority

Minor
Configure