CS 510: Computing for Scientists

Orientation and Tools

Assistant Professor Justin Dressel

Faculty of Mathematics, Physics, and Computation

Schmid College of Science and Technology

 

Why Scientific Computing? 

  • Scientists typically have no formal training in software engineering, but write code anyway 

 

  • Trained software engineers rarely encounter problems having math or performance requirements as demanding as science

 

  • Only by understanding both halves of this problem can we produce adequate solutions

Course Assumptions

  • Scientists need computation now more than ever
  • (Big) Data science is an exploding and related field
  • Scientific computing tools are rapidly evolving
  • Multi-language solutions are now commonplace
  • Mobile and cloud-friendly standards are now dominant
  • Scientists come from diverse (non-CS) backgrounds

Course Strategy

  • The best defense is a good offense
  • Scientists should be self-sufficient learners
  • The best way to learn is to do
  • Breadth for perspective, depth from experience
  • Data-driven and testing-driven development

Trichotomy of Data

Modeling

Processing

Visualization

Mathematical objects

Physical states

CS types/structures

 Transformations/Morphisms

Evolution/state updates

Algorithms/programs

Plots/graphs/movies/arrays

Scientific problems are data-centric

Scientific Problem Solving

  1. Identify a concrete question
  2. Identify the data to be modeled/processed/visualized
  3. Decompose the question into smaller problems  
  4. Unit test known cases to verify each small problem
  5. Implement solutions, ensuring all tests pass correctly
  6. Compose smaller solutions to answer main question
  7. Verify correctness of solutions with test cases
  8. Validate correctness of approach overall
  9. Profile performance of solutions
  10. Optimize:
    1. rethink data structures (2.)
    2. refactor common code (3.)
    3. redesign efficient algorithms (5.)
    4. reimplement slow parts as faster modules (5.)
    5. retest to catch and prevent bugs (7-8.)

Problem

(why)

Interface (what)

Test Cases (verify)

Implementation (how)

  • Data structure choice
  • Algorithm choice
  • Language choice
  • Other optimizations

Modular Design

Each small problem is an encapsulated module

To answer the main question, many modules must work together

Changing the implementation should not affect the interface

Language Paradigms

"Object-oriented"

"Functional"

"Logical"

  • Structure is primary (classes/objects)
  • Functions belong to data (object methods)
  • Data arranged in hierarchy (inheritance)
  • Examples:
    C++, Java, Python, Ruby, Julia, Rust
  • Process is primary (functions)
  • Data belong to functions (closures)
  • Functions arranged in hierarchy (closures)
  • Examples:
    Mathematica, Python,
    Haskell, Julia, R
  • Constraints primary (rules/clauses)
  • Data and functions belong to clauses
  • Hierarchical clauses become unified
  • Examples:
    Prolog, Mercury, Curry, 

    Python (LogPy, Pyke)

"Procedural"

  • Instructions primary, limited types of data
  • e.g.: C, MATLAB, Python, Julia, Go, Rust

(Others:  concurrent, actor-based, data-flow, rule-based, symbolic, etc.)

Language Abstraction Levels

  • Hardware level (Binary, Machine code, Assembly)
    Code executed by the hardware processor directly as procedural instructions
  • Low-level native compilation (C, Go, Rust)
    Cross-architecture abstraction, compiled to machine code before execution
  • High-level native compilation (C++, Haskell, LISP)
    Sophisticated and expressive abstraction, still compiles to machine code first
  • Just-In-Time (JIT) compilation (Julia, Java, Matlab)
    Cross-platform compiled code, converted to machine code during execution
  • Interpreted bytecode (Python, R, Mathematica)
    Code read and executed by an interpreter, which can use compiled libraries

Increasing abstraction away from hardware:

      increases portability, simplifies coding, decreases efficiency

High-level

Low-level

Rule of thumb

BUT: Efficiency is multi-faceted

  1. Rethink efficient algorithms and data structures
  2. Exploit modular design
    • solve problems at a high level, but use fast low-level libraries for speed
  3. Profile code, refactor, and reimplement libraries
    • isolate slow parts, and replace just those with faster low-level libraries
  4. Parallelize code
    • run independent pieces of code on many processors (or GPUs)
  5. Reimplement entire correct solution (rapid prototyping):
    1. Replace interpreters with Just-In-Time (JIT) compilers (e.g., JVM, .NET)
    2. Replace JIT with high-level native code-compilation (e.g., LLVM, C++)
    3. Abandon high-level code for low-level native compilation (e.g., C, Rust)

Order of Importance:

Low-level code is difficult to write correctly, and scientific problems are often hard to solve

A more effective strategy: solve problem with high-level language, then...

"Blah blah blah!"
What does this all mean in practice?

  1. We can get very far using high-level languages first, provided that we use effective design principles
  2. Learning several language-levels is beneficial

    In this course, we will focus on three main languages:
    1. Python : multi-paradigm, easy-to-learn, popular
    2. Julia : multi-paradigm, easy-to-learn, efficient
    3. C : close to hardware, hard-to-learn, foundational
       

We will also briefly become acquainted with C++, the object-oriented evolution of the C language.

However, our main focus will be on the problem-solving and software-development process

Our Initial Toolkit

  • Distributed Change Control
    Change-tracked code collaboration (in the cloud)
  • Notebook-based Development
    Reproducible research, explorative problem-solving
    • CoCalc (formerly Sage Math Cloud)
      • Jupyter (formerly IPython)
        • Notebooks for Python, Julia, R, etc.
      • Linux virtual machine in the cloud
        • Supports C, C++, LaTeX, Julia, Python, etc.
        • Easily connected to GitHub

GitHub Account

  1. Create a GitHub account
  2. Join the Organization:
    https://github.com/chapman-cs510-2017f
  3. Find/read course material in the info repository therein

CoCalc Account

  1. Create a CoCalc account linked to your GitHub account
  2. Send me your username on Slack, so I can add you to the course project
  3. Open your course project once it is available. Familiarize yourself with the interface, and documentation
  4. Create a general use Terminal Terminal.term for later use

Step 0: Get Tools

Slack Account

  1. Find your email invite to
     http://scststudents.slack.com
  2. Join Slack and join channel:
    #cs510-2017f
  3. Use Slack to communicate with instructor / colleagues

Blackboard Use

  1. Find links to classwork and assignments on Blackboard each week

CS 510 - Orientation

By Justin Dressel

CS 510 - Orientation

In this Chapman University course in the Data and Computational Sciences (CADS) graduate program, you will be introduced to a broad array of powerful and inter-operable tools for exploring and solving scientific problems in a computational setting.

  • 3,256