2015 SpaceApps
Data Projects that
were awesome
Jason Duley
Big (Data + Think) 2
04-21-2015
Topics
- Why Data Challenge?
- The Challenge
- Highlighted Solutions
- Next Steps
- Production Usage
Big (Data + Think) 2 - jason.duley@nasa.gov
Why Data Challenge?
Open Data publishers have limited time, resources and domain knowledge to optimally develop metadata to make datasets more easily accessible and discoverable for citizens working to solve data-driven problems
Big (Data + Think) 2 - jason.duley@nasa.gov
Why Data Challenge?
Background:
- Many are interested in new ways to leverage open data and integrate these information assets into innovative databases, services and applications
Problem:
- Inconsistent metadata, such as keyword tagging impacts information engines such as Google, the ability to discover assets
Big (Data + Think) 2 - jason.duley@nasa.gov
The Challenge
Discover keywords to describe the potential, hidden, secondary uses of open data using any technique that might help discover new keywords. Some seedling ideas:
-
Crowdsourcing Approaches
- display data asset information online, query people about how the assets can be used
- Predictive Analytics or Machine-Learning
- compare metadata and the data of one information asset to another in order to find new keywords
-
Unique Identifier Analysis of Published Data
- search on the web, discover who already used the data and for what purpose, then catalog it
We provided a Challenge Starter Toolkit
Big (Data + Think) 2 - jason.duley@nasa.gov
#humans, #datatreasurehunting, #advanced
Highlighted Solutions
20 projects submitted for the
Data Treasure Hunt challenge
notables:
Degrees of Data
Open Data Gold Digger
Keyword Distillery
NYSpaceTag
Big (Data + Think) 2 - jason.duley@nasa.gov
Degrees Of Data
Approach:
- Solve using Twitter API
- Leverage people tagging their tweets with #hashtags
- Search for tweets containing the input keyword as a hashtag
- Examine all the other hashtags as "relevant" keywords.
- Look at amount of occurrences
- Average them to set a threshold.
- Any keywords over that threshold are output
Big (Data + Think) 2 - jason.duley@nasa.gov
Open Data Gold Digger
Approach:
- Focused on taking the conglomerates of raw data and mining this data into smaller streams of useful, relevant data
- Search for tweets containing the input keyword as a hashtag
- Examine all the other hashtags as "relevant" keywords.
- Look at amount of occurrences
- Average them to set a threshold.
- Any keywords over that threshold are output
Big (Data + Think) 2 - jason.duley@nasa.gov
Keyword Distillery
Approach:
- Keywords given scaled weights by inversing the number of search results returned
- Public data was then crawled, each dataset was scanned for matching keywords
- Keyword frequency was calculated and stored in the relationship map
- Freqency was calculated by dividing the total number of keyword instances by the length of the document
Big (Data + Think) 2 - jason.duley@nasa.gov
NYSpaceTag
Approach:
- Using NLP scripts, 16k NASA datasets were run and keywords were generated from datasets titles and descriptions
- Built a Recommendation Engine (
- devised a metric for measuring similarity of projects
- based on synonym sets
- Visual Search Engine
Big (Data + Think) 2 - jason.duley@nasa.gov
Next Steps
- Work with Project team(s) to mature software
- Compare keyword generation to Alchemy API based approaches
- Package features as services
- Integrate into production use
- Calculate and show upper management the ROI from SpaceApps
Big (Data + Think) 2 - jason.duley@nasa.gov

SAC Data Projects
By Jason Duley
SAC Data Projects
From the 2015 SpaceApps Challenge code-a-thon event, and specifically for the Data Treasure Hunt Challenge, we've identified a few projects created by our participants worth highlighting.
- 499