CS 510: Computing for Scientists
Orientation and Tools
Assistant Professor Justin Dressel
Faculty of Mathematics, Physics, and Computation
Schmid College of Science and Technology
Why Scientific Computing?
- Scientists typically have no formal training in software engineering, but write code anyway
- Trained software engineers rarely encounter problems having math or performance requirements as demanding as science
- Only by understanding both halves of this problem can we produce adequate solutions
Course Assumptions
- Scientists need computation now more than ever
- (Big) Data science is an exploding and related field
- Scientific computing tools are rapidly evolving
- Multi-language solutions are now commonplace
- Mobile and cloud-friendly standards are now dominant
- Scientists come from diverse (non-CS) backgrounds
Course Strategy
- The best defense is a good offense
- Scientists should be self-sufficient learners
- The best way to learn is to do
- Breadth for perspective, depth from experience
- Data-driven and testing-driven development
Trichotomy of Data
Modeling
Processing
Visualization
Mathematical objects
Physical states
CS types/structures
Transformations/Morphisms
Evolution/state updates
Algorithms/programs
Plots/graphs/movies/arrays
Scientific problems are data-centric
Scientific Problem Solving
- Identify a concrete question
- Identify the data to be modeled/processed/visualized
- Decompose the question into smaller problems
- Unit test known cases to verify each small problem
- Implement solutions, ensuring all tests pass correctly
- Compose smaller solutions to answer main question
- Verify correctness of solutions with test cases
- Validate correctness of approach overall
- Profile performance of solutions
-
Optimize:
- rethink data structures (2.)
- refactor common code (3.)
- redesign efficient algorithms (5.)
- reimplement slow parts as faster modules (5.)
- retest to catch and prevent bugs (7-8.)
Problem
(why)
Interface (what)
Test Cases (verify)
Implementation (how)
- Data structure choice
- Algorithm choice
- Language choice
- Other optimizations
Modular Design
Each small problem is an encapsulated module
To answer the main question, many modules must work together
Changing the implementation should not affect the interface
Language Paradigms
"Object-oriented"
"Functional"
"Logical"
- Structure is primary (classes/objects)
- Functions belong to data (object methods)
- Data arranged in hierarchy (inheritance)
- Examples:
C++, Java, Python, Ruby, Julia, Rust
- Process is primary (functions)
- Data belong to functions (closures)
- Functions arranged in hierarchy (closures)
-
Examples:
Mathematica, Python, Haskell, Julia, R
- Constraints primary (rules/clauses)
- Data and functions belong to clauses
- Hierarchical clauses become unified
-
Examples:
Prolog, Mercury, Curry,
Python (LogPy, Pyke)
"Procedural"
- Instructions primary, limited types of data
-
e.g.: C, MATLAB, Python, Julia, Go, Rust
(Others: concurrent, actor-based, data-flow, rule-based, symbolic, etc.)
Language Abstraction Levels
-
Hardware level (Binary, Machine code, Assembly)
Code executed by the hardware processor directly as procedural instructions -
Low-level native compilation (C, Go, Rust)
Cross-architecture abstraction, compiled to machine code before execution -
High-level native compilation (C++, Haskell, LISP)
Sophisticated and expressive abstraction, still compiles to machine code first -
Just-In-Time (JIT) compilation (Julia, Java, Matlab)
Cross-platform compiled code, converted to machine code during execution -
Interpreted bytecode (Python, R, Mathematica)
Code read and executed by an interpreter, which can use compiled libraries
Increasing abstraction away from hardware:
increases portability, simplifies coding, decreases efficiency
High-level
Low-level
Rule of thumb
BUT: Efficiency is multi-faceted
- Rethink efficient algorithms and data structures
- Exploit modular design
- solve problems at a high level, but use fast low-level libraries for speed
- Profile code, refactor, and reimplement libraries
- isolate slow parts, and replace just those with faster low-level libraries
- Parallelize code
- run independent pieces of code on many processors (or GPUs)
- Reimplement entire correct solution (rapid prototyping):
- Replace interpreters with Just-In-Time (JIT) compilers (e.g., JVM, .NET)
- Replace JIT with high-level native code-compilation (e.g., LLVM, C++)
- Abandon high-level code for low-level native compilation (e.g., C, Rust)
Order of Importance:
Low-level code is difficult to write correctly, and scientific problems are often hard to solve
A more effective strategy: solve problem with high-level language, then...
"Blah blah blah!"
What does this all mean in practice?
- We can get very far using high-level languages first, provided that we use effective design principles
- Learning several language-levels is beneficial
In this course, we will focus on three main languages:- Python : multi-paradigm, easy-to-learn, popular
- Julia : multi-paradigm, easy-to-learn, efficient
-
C : close to hardware, hard-to-learn, foundational
We will also briefly become acquainted with C++, the object-oriented evolution of the C language.
However, our main focus will be on the problem-solving and software-development process
Our Initial Toolkit
-
Distributed Change Control
Change-tracked code collaboration (in the cloud) -
Notebook-based Development
Reproducible research, explorative problem-solving
GitHub Account
- Create a GitHub account
- Join the Organization:
https://github.com/chapman-cs510-2017f - Find/read course material in the info repository therein
CoCalc Account
- Create a CoCalc account linked to your GitHub account
- Send me your username on Slack, so I can add you to the course project
- Open your course project once it is available. Familiarize yourself with the interface, and documentation
- Create a general use Terminal Terminal.term for later use
Step 0: Get Tools
Slack Account
- Find your email invite to
http://scststudents.slack.com - Join Slack and join channel:
#cs510-2017f - Use Slack to communicate with instructor / colleagues
Blackboard Use
- Find links to classwork and assignments on Blackboard each week
CS 510 - Orientation
By Justin Dressel
CS 510 - Orientation
In this Chapman University course in the Data and Computational Sciences (CADS) graduate program, you will be introduced to a broad array of powerful and inter-operable tools for exploring and solving scientific problems in a computational setting.
- 3,245