21stof December, 2021

Current annoying problems and solutions

(BLISS <= 1.9.x)

Problem #1

Acquisition is a CPU hog

Acquisition is a CPU hog

reading() function in acquisition objects get data from hardware

 

This happens in parallel (multiple acquisition objects per scan)

 

Hence, it is running in separate greenlets - it has to be cooperative

 

But, we observed it is not always the case depending on the acquisition : can lead to unexpected timeouts

Acquisition is a CPU hog

P201

reading task

many very small packets

task switch

reading task

"long" time

reading task

many very small packets

MUSST

Timeout risk zone

(keep in mind, in reality there are more than 2 tasks...)

high CPU usage

There is no "fair scheduling" in gevent (like with most event loops)

"good device": goes fast but with a buffer

"bad device": no buffering, needs fast comm.

Solution to problem #1

1. Introducing sleep() calls == increasing latency

2. Buffering == bigger chunks of data, less often

3. Going multi-threaded == sharing load over multiple CPUs

Problem #2

Other unexpected timeouts

Causes for timeouts

 

1) network failure

2) device failure

3) busy event loop (case #1, see previous slides)

4) paused process due to blocking system call

UNLIKELY

UNLIKELY

for example, statvfs

Solution for problem #2

Ensure there are no blocking system calls in BLISS or Writer !

 

=> will use gevent.threadpool to execute the blocking calls, or another solution

 

Apart from `os.statvfs` (which we reminded yesterday... unfortunately) we should not have blocking calls in BLISS library and shell

 

Problem #3

Concurrency issues with multiple processes

How to share objects in multiple processes ?

Objects could be independent in each process,

state would be shared via peer-to-peer communication !

 

Objects would be locked via a distributed lock manager, to avoid concurrency issues

Back in 2015... BLISS design

Independent objects, p2p state exchange, DLM for locking

Like Hydroxychloroquine against COVID, this approach is appealing in vitro, but it is not the same business in vivo

1. "independent objects" connect themselves to devices, but most devices do not accept multiple connections !

 

 

2. State exchange has to be made right == cannot be "hand-made" case-by-case, need a systematic approach

3. Distributed Lock has to be robust and correct

concurrent access problem

1) multiple sessions sharing hardware

 

2) executing background activity: regulation, monitoring

 

3) beamline graphical applications: users need to access a console, to type commands and to execute scripts

Use cases for "distributed" BLISS

Problem #4

Embedding BLISS in GUIs

Embedding BLISS in ESRF GUIs

The new ESRF GUIs are all web-based (MXCuBE, BsxCuBE, Daiquiri == ID21, ID13, BM05, etc)

 

The strategy is to go to web applications

 

Today, embedding BLISS into a web application == using BLISS as a library == not having a shell == having an extra process for the CLI == falling into problem #3

Problem #5

TMUX

TMUX is painful for users and causes instability

Copy/paste problem

 

Intermittent 100% CPU consumption

 

Mouse interaction

Solution to problem #3, #4 and #5

BLISS web shell

  • server-side plugin to be activated by Daiquiri, or to be started stand-alone
    • based on Flask blueprint
  • client React component
    • web sockets, xtermjs

Access to multiple sessions by different tabs in the web browser (for example)

BLISS web shell demo

Problem #6

BLISS and Workflows

BLISS and EWOKS

BLISS has to be able to:

1. start workflows designed via Ewoks (JSON file)

2. wait for result(s)

3. getting result(s), if needed

 

Note about the difference between jobs and tasks (DAU concept)

Workflows are made of tasks

 

Job == global execution of tasks

(could be a Workflow or something else)

Solution to problem #5: EWOKS object returning a promise

image/svg+xml

from ... import run_job

uses BLISS Scan Watcher internally

promise = run_job("...execute_workflow", args=("path/to/file.yaml")

 

promise.get()

Ewoks workflow

JSON (or YAML) files for EWOKS could be part of Beacon configuration files

The actual "job runner" is not decided yet (Zocalo? Celery? Something else?)

Ewoks could push results back to Redis, like scans do == compatible with Writer, Flint

Python data analysis scripts

Not BCU business

Problem #6

Data storage in Redis

Data flow

Acquisition chain

Channels

producer

producer

producer

producer

consumer

consumer

consumer

XADD

XREAD

  • data publishing using redis streams
    • 1 stream per data channel

Redis streams

individual string values

we arbitrary limit streams to 2048 string values max

1 string value can correspond to multiple data events

BLISS client API: knows scans structure, how to read data

Redis keeps data for a small amount of time

How small ?

  • today (BLISS 1.9)
    • scan data stays for 5 minutes, starting from end of scan
      • enough time for Writer, Flint, or BLISS-based ODA
    • dataset and scan nodes: TTL is set to 24H when ICAT confirm ingestion, or at the end of experiment
    • sample, proposal, parent nodes: TTL set to 24H at the end of experiment
  • before
    • scan data was staying for 24 hours
    • data policy nodes were deleted like scan data nodes

Redis keeps data for a small amount of time

Why ?

- people do not take care of Redis memory, and do not allocate "enough" memory

- in case of low memory condition, errors are reported to users, unfair blame on BLISS

- too much keys can make some operations on the global redis contents slower (could be improved on BLISS side, if needed)

It has always been said redis is a transient storage, not permanent

Solution to problem #6

We need to define, beamline per beamline, and experiment by experiment, what are the optimal settings

  • what is easily configurable today: Redis memory limit
  • what we can expose to users: the 5 min and 1 day TTL, the 2048 events per channel limit

Part of the solution: future ESRF-approved way to access a shared memory space, backed by infrastructure ($$$),

on-going "ODA" discussions with V. Favre Nicollin

but... there is always a limit !

Conclusion

Since 2015 and the initial design of BLISS we have gained experience...

  • need to use multi-threading for some reading loops
  • forget about peer-to-peer state exchange between objects
  • redis memory cannot store all scans data for a long time

... and there are new demands

  • BLISS to be used from web GUIs
  • workflows
  • online data analysis at a larger scale

Usage is more defined

  • TMUX is painful
  • still need for multiple sessions (== views on the beamline) in parallel

Conclusion

So, BLISS needs to be adjusted accordingly

But there is no big "design change"

Improvements will need the help/participation of BCU + different people in Software Group

BLISS problems and solutions (2021)

By Matias Guijarro

BLISS problems and solutions (2021)

A quick view on some identified problems in BLISS and proposed solutions

  • 220