Climate Data Visualization
- Data Visualization is one of the most difficult, expensive and fastest-growing areas in computing w.r.t big data
- Climate Data sets are unique and don't resemble industry data sets making usage of 3rd party tools more challenging
- NCEI, unlike industry, faces resourcing challenges of many kinds
- NCEI, as an institution, does not have a cuture or history of building big-data full-stack systems
GST SciTech - George Kierstein - 12/21/2015
State of the Art
Lambda Architectures
Fact-Based
- Events that impact the System's ability to perform its task are recorded as a 'Fact'
EX: Event: New Data
Timestamp: 12/21/2015 10:35:04 PM UTC
URL: https://dataserver.ncei.gov/data=B1&range=...
Space/Time vs. Compute Cost
- Data Visualization has a high compute cost.
- One can trade space/time costs for compute costs by making architectural choices
Lamda Architectures
Fact-Based, Immutable
- Facts are carefully stored as immutable
- All additional actions needed to ensure the system fulfills its requirements are computed and stored as disposable helpers that address performance.
Lamda Architectures
Fact-Based, Immutable
- Flexible re-use of system components can accommodate highly complex sets of system requirements.
Lamda Architectures
Numerous design advantages vs. traditional silo-ed architectures
(i.e. Single-use, db-centric system designs)
- Robust and Flexible (easy re-use, easy large-scale multi-system deployment and re-configuration)
- Human fault-tolerant
- Inherently distributed, de-coupled and language agnostic
- Complete Provenance of system activity as well as the origin of data an emergent property
- Inter-op with modern systems
Internal Deployment
Alternatives
'Home Rolled'
3rd-Party Systems
Incremental Development
- Primary focus is on internal development, support and deployment
- Primary focus is on internal deployment. Low support overhead. Little or no improvement overhead.
- Primary focus on development of re-usable organization-wide systems that underpin any modern visualization system. Once mature, visualization components can be developed and deployed organization-wide.
Internal Deployment
Giovanni
Best short-term option for low-cost, low-overhead visualization capabilities deployed in-house.
- Stable, Well-funded development team.
- Mature code-base with external deployment and usage as design focus.
- Uses modern deployment tools and integration affordances. (Docker, plugin API's)
- Not easily scalable but addressing this is one area of focus for Giovanni team
- 'Federated Giovanni' will offer community-driven upsides for free.
Internal Deployment
Incremental
Best long-term option for stable, robust and affordable visualization capabilities.
- Computing derived data from big data sets (for visualization or use in another part of the system) requires a standard set of system capabilities:
- API for data discovery
- API for data retrieval
(Data Access subsetting provides part of required capabilities.)
Internal Deployment
Incremental
Best long-term option for stable, robust and affordable visualization capabilities.
- Long time-scale both an organizational challenge and an opportunity to maximize organizational efforts in a sustainable way. (i.e. Leveraging broad set of internal efforts such as Data Access, OneArchive, etc.)
External Deployment
Wholesale externalization of CDR data
i.e. Amazon, Google, etc. warehousing environmental data in bulk with large providers.
- Efforts are slow-moving and have generally been not as enthusiastic as hoped according to some involved in the 'Big Data' initiative.
- Understandable as there isn't an obvious quantitatively convincing market-based motivation for these organizations to invest heavily in infrastructure for the foreseeable short to medium term future.
External Deployment
PlanetOS
i.e. Collaboration with a modern organization to vet our efforts with a profitable customer in the environmental data space.
- High-level design guidance and development support with an actual customer who is experienced with environmental data.
- Already part of the 'Big Data' initiative
- Offers some level of visualization capabilities 'for free' as a partner.
- Profitable (and well-funded) but small enough to be invested and responsive in a collaboration.
Recommendations
Internal Efforts
External Efforts
- Short-term: Giovanni (Azure cloud)
- Long-term: Organizational-scale adoption of Lambda architecture principles backed by micro-services that are used both internally for development and externally by partners and the public.
Collaboration with Planet OS to define and develop discovery and access API's that outline organizational-scale micro-service capabilities providing a principled set of architectural design attributes that save work, time, and money.
NCEI Climate Data Visualization
By gatewayspectacle
NCEI Climate Data Visualization
- 295