Climate Data Visualization

  • Data Visualization is one of the most difficult, expensive and fastest-growing areas in computing w.r.t big data
  • Climate Data sets are unique and don't resemble industry data sets making usage of 3rd party tools more challenging
  • NCEI, unlike industry, faces resourcing challenges of many kinds
  • NCEI, as an institution, does not have a cuture or history of building big-data full-stack systems

       GST SciTech    -    George Kierstein    -     12/21/2015

State of the Art

Lambda Architectures

Fact-Based

  • Events that impact the System's ability to perform its task are recorded as a 'Fact'

    EX:  Event: New Data
                Timestamp: 12/21/2015 10:35:04 PM UTC
                URL: https://dataserver.ncei.gov/data=B1&range=...

Space/Time vs. Compute Cost

  • Data Visualization has a high compute cost.
  • One can trade space/time costs for compute costs by making architectural choices

Lamda Architectures

Fact-Based, Immutable

  • Facts are carefully stored as immutable 
  • All additional actions needed to ensure the system fulfills its requirements are computed and stored as disposable helpers that address performance.

Lamda Architectures

Fact-Based, Immutable

  • Flexible re-use of system components can accommodate  highly complex sets of system requirements.

Lamda Architectures

Numerous design advantages vs. traditional silo-ed architectures

(i.e. Single-use, db-centric system designs)

  • Robust and Flexible (easy re-use, easy large-scale multi-system deployment and re-configuration)
  • Human fault-tolerant
  • Inherently distributed, de-coupled and language agnostic
  •  Complete Provenance of system activity as well as the origin of data an emergent property
  • Inter-op with modern systems

Internal Deployment

Alternatives

'Home Rolled'

3rd-Party Systems

Incremental Development

- Primary focus is on internal development, support and deployment

- Primary focus is on internal deployment. Low support overhead. Little or no improvement overhead.

- Primary focus on development of re-usable organization-wide systems that underpin any modern visualization system. Once mature, visualization components can be developed and deployed organization-wide.

Internal Deployment

Giovanni

Best short-term option for low-cost, low-overhead visualization capabilities deployed in-house.

  • Stable, Well-funded development team.
  • Mature code-base with external deployment and usage as design focus.
  •  Uses modern deployment tools and integration affordances. (Docker, plugin API's)
  • Not easily scalable but addressing this is one area of focus for Giovanni team
  • 'Federated Giovanni' will offer community-driven upsides for free.

Internal Deployment

Incremental

Best long-term option for stable, robust and affordable visualization capabilities.

  • Computing derived data from big data sets (for visualization or use in another part of the system) requires a standard set of system capabilities:
          - API for data discovery   
          - API for data retrieval  
        (Data Access subsetting provides part of required capabilities.)

Internal Deployment

Incremental

Best long-term option for stable, robust and affordable visualization capabilities.

  • Long time-scale both an organizational challenge and an opportunity to maximize organizational efforts in a sustainable way. (i.e. Leveraging broad set of internal efforts such as Data Access, OneArchive, etc.)

External Deployment

Wholesale externalization of CDR data

i.e. Amazon, Google, etc. warehousing environmental data in bulk with large providers.

  • Efforts are slow-moving and have generally been not as enthusiastic as hoped according to some involved in the 'Big Data' initiative.
  • Understandable as there isn't an obvious quantitatively convincing market-based motivation for these organizations to invest heavily in infrastructure for the foreseeable short to medium term future.

External Deployment

PlanetOS

i.e. Collaboration with a modern organization to vet our efforts with a profitable customer in the environmental data space.

  • High-level design guidance and development support with an actual customer who is experienced with environmental data.
  • Already part of the 'Big Data' initiative
  • Offers some level of visualization capabilities 'for free' as a partner.
  • Profitable (and well-funded) but small enough  to be invested and responsive in a collaboration.

Recommendations

Internal Efforts

External Efforts

- Short-term: Giovanni (Azure cloud)

- Long-term: Organizational-scale adoption of Lambda architecture principles backed by micro-services that are used both internally for development and externally by partners and the public.

Collaboration with Planet OS to define and develop discovery and access API's that outline organizational-scale micro-service capabilities providing a principled set of architectural design attributes that save work, time, and money.

Made with Slides.com