beamline control software fault tolerance for MXCuBE 3

10th MXCuBE meeting, 16-18 January 2017, Grenoble, France

presented by Matias Guijarro, Beamline Control Unit, Software Group, ESRF

Beamline control software can fail in many ways

  • Unresponsive hardware: can't communicate, risk of blocking
  • Restarted devices: failed, missing or incomplete reinitialization
  • Unexpected hardware state: breaking state machines, can become unresponsive
  • Segfault from underlying third-party control library: exiting, no clean up
  • Uncaught exceptions + our own bugs: complete havoc

Failures should be handled at the control software level...

 

...but it still happens all the time

MXCuBE layers of control

Hardware

Python hardware objects

User Interface level

TANGO

Sardana

EPICS

Tine

spec

library (eg. LIMA) or:

socket,

serial,

USB...

BLISS

MXCuBE layers of control

  • What happens in case of beamline control software failure ?

Python hardware objects

TANGO

Sardana

EPICS

Tine

spec

library (eg. LIMA) or:

socket,

serial,

USB...

BLISS

Quite often MXCuBE has to be restarted

Particular case of MXCuBE 3

Hardware

Python hardware objects

TANGO

Sardana

EPICS

Tine

spec

library (eg. LIMA) or:

socket,

serial,

USB...

BLISS

User interface (front-end)

Web application server (back-end)

How to improve this situation ?

Proposal:

isolating Hardware Objects in their own process

Hardware

Python hardware objects

TANGO

Sardana

EPICS

Tine

spec

library (eg. LIMA) or:

socket,

serial,

USB...

BLISS

User interface (front-end)

Web application server (back-end)

?

How ?

Hardware Objects living in their own process

  • Objects proxying
  • Communication between calling code (server) and hardware objects

RPC

Serialization

Events

  • Hardware objects process monitoring
  • Debugging

Objects proxying

  • Minimal changes to calling code (see __init__.py from MXCuBE 3)

...

hwr_directory = cmdline_options.hwr_directory
hwr = hwr.HardwareRepository(os.path.abspath(os.path.expanduser(hwr_directory)))
hwr.connect()

...

def complete_initialization(app):
        app.beamline = hwr.getHardwareObject(cmdline_options.beamline_setup)
        app.session = app.beamline.getObjectByRole("session")
        app.collect = app.beamline.getObjectByRole("collect")

        Utils.enable_snapshots(app.collect)

        app.diffractometer = app.beamline.getObjectByRole("diffractometer")

        if getattr(app.diffractometer, 'centring_motors_list', None) is None:
            # centring_motors_list is the list of roles corresponding to diffractometer motors
            app.diffractometer.centring_motors_list = app.diffractometer.getPositions().keys()

        app.db_connection = app.beamline.getObjectByRole("lims_client")
        app.empty_queue = pickle.dumps(hwr.getHardwareObject(cmdline_options.queue_model))
        app.sample_changer = app.beamline.getObjectByRole("sample_changer")
        app.rest_lims = app.beamline.getObjectByRole("lims_rest_client")
        app.queue = qutils.new_queue()

isolatedHwr = startHWRProcess(os.path.abspath(os.path.expanduser(hwr_directory)))

isolatedHwr.getProxy(

cmdline_options.beamline_setup)

'getObjectByRole' has to return a proxy, too

Objects proxying in action

mxcube.beamline.getObjectByRole("session")

Python hardware objects

Web application server (back-end)

incoming

request

session

hwobj

beamline

hwobj

return

value

value is hwobj?

return proxy

Return values have to be serialized. If a return value is a Hardware Object, a proxy to this object has to be returned to the caller instead.

inter-process

communication

Beware of unserializable values, e.g greenlet objects

Communication between caller and hardware objects

  • Which RPC protocol to use ?

- Tango

- xml-rpc or equivalent (json-rpc, message pack, Thrift, etc.)

- Something home-brewed

  • Serialization

- want to be able to pass Python structures (lists, dicts, objects, etc.)

  • Events

- nice flask-socketio feature: message queue (based on redis)

- any message from any process sent to the redis queue is forwarded to clients

pickle (native Python serialization)

Hardware Objects process monitoring

  • Corner stone of all this
  • Need to be able to detect and react on:

- process killed / restarted

- process becoming unresponsive

- dead process (segfault)

  • Possible implementation

- forking Hardware Objects process

- use of gipc for gevent-friendly fork + gevent-friendly IPC

- gevent-based keep-alive

Hardware Objects process debugging

- remote REPL inside the running process

- connect to it with telnet

Embedding vs. distributed objects

  • Hardware Objects can communicate directly with hardware...

E.g. when a third-party library is used (LIMA for example), or when control is embedded directly in code using sockets, serial, usb, etc.

  • ... or delegate hardware access to distributed objects

E.g. TANGO, Sardana, EPICS

  • what is the best approach to avoid or recover from beamline control software failure ?