beamline control software fault tolerance for MXCuBE 3
10th MXCuBE meeting, 16-18 January 2017, Grenoble, France
presented by Matias Guijarro, Beamline Control Unit, Software Group, ESRF
Beamline control software can fail in many ways
- Unresponsive hardware: can't communicate, risk of blocking
- Restarted devices: failed, missing or incomplete reinitialization
- Unexpected hardware state: breaking state machines, can become unresponsive
- Segfault from underlying third-party control library: exiting, no clean up
- Uncaught exceptions + our own bugs: complete havoc
Failures should be handled at the control software level...
...but it still happens all the time
MXCuBE layers of control
Hardware
Python hardware objects
User Interface level
TANGO
Sardana
EPICS
Tine
spec
library (eg. LIMA) or:
socket,
serial,
USB...
BLISS
MXCuBE layers of control
- What happens in case of beamline control software failure ?
Python hardware objects
TANGO
Sardana
EPICS
Tine
spec
library (eg. LIMA) or:
socket,
serial,
USB...
BLISS
Quite often MXCuBE has to be restarted
Particular case of MXCuBE 3
Hardware
Python hardware objects
TANGO
Sardana
EPICS
Tine
spec
library (eg. LIMA) or:
socket,
serial,
USB...
BLISS
User interface (front-end)
Web application server (back-end)
How to improve this situation ?
Proposal:
isolating Hardware Objects in their own process
Hardware
Python hardware objects
TANGO
Sardana
EPICS
Tine
spec
library (eg. LIMA) or:
socket,
serial,
USB...
BLISS
User interface (front-end)
Web application server (back-end)
?
How ?
Hardware Objects living in their own process
- Objects proxying
- Communication between calling code (server) and hardware objects
RPC
Serialization
Events
- Hardware objects process monitoring
- Debugging
Objects proxying
-
Minimal changes to calling code (see __init__.py from MXCuBE 3)
...
hwr_directory = cmdline_options.hwr_directory
hwr = hwr.HardwareRepository(os.path.abspath(os.path.expanduser(hwr_directory)))
hwr.connect()
...
def complete_initialization(app):
app.beamline = hwr.getHardwareObject(cmdline_options.beamline_setup)
app.session = app.beamline.getObjectByRole("session")
app.collect = app.beamline.getObjectByRole("collect")
Utils.enable_snapshots(app.collect)
app.diffractometer = app.beamline.getObjectByRole("diffractometer")
if getattr(app.diffractometer, 'centring_motors_list', None) is None:
# centring_motors_list is the list of roles corresponding to diffractometer motors
app.diffractometer.centring_motors_list = app.diffractometer.getPositions().keys()
app.db_connection = app.beamline.getObjectByRole("lims_client")
app.empty_queue = pickle.dumps(hwr.getHardwareObject(cmdline_options.queue_model))
app.sample_changer = app.beamline.getObjectByRole("sample_changer")
app.rest_lims = app.beamline.getObjectByRole("lims_rest_client")
app.queue = qutils.new_queue()
isolatedHwr = startHWRProcess(os.path.abspath(os.path.expanduser(hwr_directory)))
isolatedHwr.getProxy(
cmdline_options.beamline_setup)
'getObjectByRole' has to return a proxy, too
Objects proxying in action
mxcube.beamline.getObjectByRole("session")
Python hardware objects
Web application server (back-end)
incoming
request
session
hwobj
beamline
hwobj
return
value
value is hwobj?
return proxy
Return values have to be serialized. If a return value is a Hardware Object, a proxy to this object has to be returned to the caller instead.
inter-process
communication
Beware of unserializable values, e.g greenlet objects
Communication between caller and hardware objects
- Which RPC protocol to use ?
- Tango
- xml-rpc or equivalent (json-rpc, message pack, Thrift, etc.)
- Something home-brewed
- Serialization
- want to be able to pass Python structures (lists, dicts, objects, etc.)
- Events
- nice flask-socketio feature: message queue (based on redis)
- any message from any process sent to the redis queue is forwarded to clients
pickle (native Python serialization)
Hardware Objects process monitoring
- Corner stone of all this
- Need to be able to detect and react on:
- process killed / restarted
- process becoming unresponsive
- dead process (segfault)
- Possible implementation
- forking Hardware Objects process
- use of gipc for gevent-friendly fork + gevent-friendly IPC
- gevent-based keep-alive
Hardware Objects process debugging
- Enable gevent BackdoorServer ?
- remote REPL inside the running process
- connect to it with telnet
Embedding vs. distributed objects
- Hardware Objects can communicate directly with hardware...
E.g. when a third-party library is used (LIMA for example), or when control is embedded directly in code using sockets, serial, usb, etc.
- ... or delegate hardware access to distributed objects
E.g. TANGO, Sardana, EPICS
- what is the best approach to avoid or recover from beamline control software failure ?
blctrlsw_fault_mxcube3
By Matias Guijarro
blctrlsw_fault_mxcube3
- 1,732