CS 510: Computing for Scientists

Orientation and Tools

Assistant Professor Justin Dressel

Faculty of Mathematics, Physics, and Computation

Schmid College of Science and Technology

Why Scientific Computing?

Scientists typically have no formal training in software engineering, but write code anyway

Trained software engineers rarely encounter problems having math or performance requirements as demanding as science

Only by understanding both halves of this problem can we produce adequate solutions

Course Assumptions

Scientists need computation now more than ever
(Big) Data science is an exploding and related field
Scientific computing tools are rapidly evolving
Multi-language solutions are now commonplace
Mobile and cloud-friendly standards are now dominant
Scientists come from diverse (non-CS) backgrounds

Course Strategy

The best defense is a good offense
Scientists should be self-sufficient learners
The best way to learn is to do
Breadth for perspective, depth from experience
Data-driven and testing-driven development

Trichotomy of Data

Modeling

Processing

Visualization

Mathematical objects

Physical states

CS types/structures

Transformations/Morphisms

Evolution/state updates

Algorithms/programs

Plots/graphs/movies/arrays

Scientific problems are data-centric

Scientific Problem Solving

Identify a concrete question
Identify the data to be modeled/processed/visualized
Decompose the question into smaller problems
Unit test known cases to verify each small problem
Implement solutions, ensuring all tests pass correctly
Compose smaller solutions to answer main question
Verify correctness of solutions with test cases
Validate correctness of approach overall
Profile performance of solutions
Optimize:
1. rethink data structures (2.)
2. refactor common code (3.)
3. redesign efficient algorithms (5.)
4. reimplement slow parts as faster modules (5.)
5. retest to catch and prevent bugs (7-8.)

Problem

(why)

Interface (what)

Test Cases (verify)

Implementation (how)

Data structure choice
Algorithm choice
Language choice
Other optimizations

Modular Design

Each small problem is an encapsulated module

To answer the main question, many modules must work together

Changing the implementation should not affect the interface

Language Paradigms

"Object-oriented"

"Functional"

"Logical"

Structure is primary (classes/objects)
Functions belong to data (object methods)
Data arranged in hierarchy (inheritance)
Examples:
C++, Java, Python, Ruby, Julia, Rust

Process is primary (functions)
Data belong to functions (closures)
Functions arranged in hierarchy (closures)
Examples:
Mathematica, Python, Haskell, Julia, R

Constraints primary (rules/clauses)
Data and functions belong to clauses
Hierarchical clauses become unified
Examples:
Prolog, Mercury, Curry,
Python (LogPy, Pyke)

"Procedural"

Instructions primary, limited types of data
e.g.: C, MATLAB, Python, Julia, Go, Rust

(Others: concurrent, actor-based, data-flow, rule-based, symbolic, etc.)

Language Abstraction Levels

Hardware level (Binary, Machine code, Assembly)
Code executed by the hardware processor directly as procedural instructions
Low-level native compilation (C, Go, Rust)
Cross-architecture abstraction, compiled to machine code before execution
High-level native compilation (C++, Haskell, LISP)
Sophisticated and expressive abstraction, still compiles to machine code first
Just-In-Time (JIT) compilation (Julia, Java, Matlab)
Cross-platform compiled code, converted to machine code during execution
Interpreted bytecode (Python, R, Mathematica)
Code read and executed by an interpreter, which can use compiled libraries

Increasing abstraction away from hardware:

increases portability, simplifies coding, decreases efficiency

High-level

Low-level

Rule of thumb

BUT: Efficiency is multi-faceted

Rethink efficient algorithms and data structures
Exploit modular design
- solve problems at a high level, but use fast low-level libraries for speed
Profile code, refactor, and reimplement libraries
- isolate slow parts, and replace just those with faster low-level libraries
Parallelize code
- run independent pieces of code on many processors (or GPUs)
Reimplement entire correct solution (rapid prototyping):
1. Replace interpreters with Just-In-Time (JIT) compilers (e.g., JVM, .NET)
2. Replace JIT with high-level native code-compilation (e.g., LLVM, C++)
3. Abandon high-level code for low-level native compilation (e.g., C, Rust)

Order of Importance:

Low-level code is difficult to write correctly, and scientific problems are often hard to solve

A more effective strategy: solve problem with high-level language, then...

"Blah blah blah!"
What does this all mean in practice?

We can get very far using high-level languages first, provided that we use effective design principles
Learning several language-levels is beneficial

In this course, we will focus on three main languages:
1. Python : multi-paradigm, easy-to-learn, popular
2. Julia : multi-paradigm, easy-to-learn, efficient
3. C : close to hardware, hard-to-learn, foundational

We will also briefly become acquainted with C++, the object-oriented evolution of the C language.

However, our main focus will be on the problem-solving and software-development process

Our Initial Toolkit

Distributed Change Control
Change-tracked code collaboration (in the cloud)
- git
- GitHub
Notebook-based Development
Reproducible research, explorative problem-solving
- CoCalc (formerly Sage Math Cloud)
  - Jupyter (formerly IPython)
    - Notebooks for Python, Julia, R, etc.
  - Linux virtual machine in the cloud
    - Supports C, C++, LaTeX, Julia, Python, etc.
    - Easily connected to GitHub

GitHub Account

Create a GitHub account
Join the Organization:
https://github.com/chapman-cs510-2017f
Find/read course material in the info repository therein

CoCalc Account

Create a CoCalc account linked to your GitHub account
Send me your username on Slack, so I can add you to the course project
Open your course project once it is available. Familiarize yourself with the interface, and documentation
Create a general use Terminal Terminal.term for later use

Step 0: Get Tools

Slack Account

Find your email invite to
http://scststudents.slack.com
Join Slack and join channel:
#cs510-2017f
Use Slack to communicate with instructor / colleagues

Blackboard Use

Find links to classwork and assignments on Blackboard each week

CS 510 - Orientation

By Justin Dressel

CS 510 - Orientation

In this Chapman University course in the Data and Computational Sciences (CADS) graduate program, you will be introduced to a broad array of powerful and inter-operable tools for exploring and solving scientific problems in a computational setting.

3,300

CS 510: Computing for Scientists

Why Scientific Computing?

Course Assumptions

Course Strategy

Trichotomy of Data

Scientific Problem Solving

Modular Design

Language Paradigms

Language Abstraction Levels

BUT: Efficiency is multi-faceted

"Blah blah blah!" What does this all mean in practice?

Our Initial Toolkit

GitHub Account

CoCalc Account

Step 0: Get Tools

Slack Account

Blackboard Use

CS 510 - Orientation

More from Justin Dressel

"Blah blah blah!"
What does this all mean in practice?