Questions

What is the difference between supervised and unsupervised learning?

Supervised: human generates the labels for correct output given the input.

DOG

CAT

Unsupervised: data set somehow generates its own correct output.

How much storage do GPT and DALLE rely on to work?

For the neural network itself, just 70 billion weights (~30 GB) and the code necessary to run the network (maybe ~15 GB).

For training....all the data used in the training set. For newer GPT models, this is basically the entire internet and then some. Such data can only be stored on "the cloud" (in massive datacenters run by the largest companies).

Is AI Art "original" or does it pull off the web?

Both can be true. Human art is "original" but every artist learns from artwork they've seen before.

I don't have a great definition for what it means to be original, so I don't know how to answer this :(

Do game designers have to model everything themselves?

Yes. Even if NeRFs could completely render any scene (they can't), you still can't interact with objects, or do physics with them, or change their appearance (e.g. for artistic reasons).

Can machines issue prompts which can be solved with machine learning?

At the moment, the ability to do this is very limited. We don't fully understand why.

Why are people so interested in AI if it can't solve problems?

Cynical answer: 🤑🤑🤑🤑

Actual answer: AI can solve really interesting problems, but it's going to be hard to tell what's actually working during the hype cycle.

 

(Personal take: the true tragedy of AI is that everyone who was hawking Bitcoin turned around and started hawking AI).

 

(Pause here to old-man-rant).

If all the research is going to waste, how is the AI field evolving so fast?

A flash flood which loses 90% of its intensity can still hit pretty hard.

Why isn't there a centralized place to record studies that didn't work?

There is, they're called journals. But even if researchers could get things published there, what incentive would they have to do it?

How do neural network layers know which building blocks to use? (e.g. Linear, ReLU).

What is the logic behind an architecture of a neural network?

User selects which functions to use on which nodes.

How do researchers make sure the image examples given to the model are valid?

They don't.

Bad training data from the internet is a huge problem for ML, but I don't know of many systematic efforts to tackle it.

When people come up with processes that don't have an application (like NeRFs), what is the end goal in mind?

"Wouldn't it be cool if we could...."

Digitizing the real world

Scanners, depth sensors, and LASERS

1. How data is stored

2. How Data is collected

3. Conversions

3D Data Representations

3D Data Representations

I've claimed throughout this class that 3D data is represented by triangle meshes.

3D Data Representations depend on what we need to know about the object

Do we need to know about its interior?

 

How accurately do we need to know the shape?

 

Do we need connectivity information or is just the shape okay?

Triangle meshes are the standard representation for graphics!

Triangle Meshes

  • Consist of faces and points
  • Represent a surface of an object
  • Can be efficiently rendered by modern graphics
  • Cannot (easily) record data stored inside the shape

Point Cloud

  • Just a bunch of points floating in space
  • Points may be dense (lots of points) or sparse (not many points)
  • Do not contain topology information: are these two separate objects or a single one?
  • Example: 👉👈

Density Functions

  • Can be thought of as a mapping of how dense a certain area of space is.
  • Imagine X-rays, but in 3D
  • Functions can be explicit (specified by hand) or implicit (e.g. learned by a neural network).
  • Specifying density functions can be data intensive

Voxels

  • Minecraft!
  • Like pixels but cubes
  • Each voxel can carry multiple data values, just like in pixel space (e.g. RGB, or density, or magnetic field, or...)
  • Tends to be expensive to store lots of small voxels, since you have to store zeros on a grid where the object isn't.

Many more representations than this!

 

  • Neural Implicits
  • Tetrahedral or Hexahedral Meshing
  • Level Sets
  • Signed Distance Function

...and many more.

Converting Between Representations

We can usually convert between different representations

But we might lose some information along the way!

Density Function to Mesh Surface

Density Function to Mesh Surface

Density Function to Mesh Surface

Density Function to Mesh Surface

Density Function to Mesh Surface

Density Function to Mesh Surface

But we can no longer recover the original density function!

Example: CT Scan

Once a full ring of scans has been completed, move the scanner head down (or move the patient up) and scan the next slice.

This looks a lot like the NERF problem!

Reconstruct unknown scene geometry from a sequence of scans.

However, in the NeRF problem, we had sparse input data (very few images) and light stops at a surface.

 

When doing CT scans, we have dense input data (one snapshot every few degrees) and X-rays pass through hard tissue (even bone to a limited degree).

 

We can recompute the density using the Radon Transform.

Result

A sequence of 2D images, each representing a different slice of the patient.

Fun fact: the DICOM image format which is used to transmit X-ray and CT data in the US was designed for arbitrary medical use, so it actually has fields that tell you if the patient is human or not. I think this patient is a dog.

Surface Reconstruction

We can extract a triangle mesh from the voxel data the same way we did for 2D data!

 

Suppose we want to extract the surface with density = 3.0

1.0

4.0

100.0

50.0

Surface Reconstruction

We can extract a triangle mesh from the voxel data the same way we did for 2D data!

 

Suppose we want to extract the surface with density = 3.0

1.0

4.0

100.0

50.0

Surface Reconstruction

We can extract a triangle mesh from the voxel data the same way we did for 2D data!

 

Suppose we want to extract the surface with density = 3.0

1.0

4.0

100.0

50.0

Surface Reconstruction

We can extract a triangle mesh from the voxel data the same way we did for 2D data!

 

Suppose we want to extract the surface with density = 3.0

1.0

4.0

100.0

50.0

Surface Reconstruction

In 3D, this yields the famous Marching Cubes algorithm.

GAthering 3D data

Unfortunately, we can't stick the world in a CT scanner.

We can't make one big enough, and the resulting x-ray exposure would do some pretty nasty things anyways.

For older projects, people built 3D collection arms.

What type of data does this collect?

In modern times, we prefer optical methods, since these are faster and cheaper to capture.

Optical Depth Data

Optical = using light.

Three general ways to acquire optical depth data.

Structured Light

Multi-View

Time-of-flight

👀

Structured Light Scanning

  1. Project a known pattern onto the 3D object
  2. Object deforms pattern based off of its depth
  3. Look at resulting pattern with camera and infer object shape

Map from deformed lines to object pattern is a function!

How can we obtain this function?

Structured Light

Advantages

Disadvantages

  • Only requires a light projector and a regular camera---very cheap!
  • Most of the complexity is handed off to machine learning (which we know how to do)
  • If the object absorbs too much light, the structured light might not be visible, and 3D reconstruction becomes bad.
  • Cannot have multiple cameras at once!

Most Ubiquitous Structured Light scanner

FaceID uses structured light to build a 3D depth scan of the user's face.

Note that multiple projection is not an issue, since people don't tend to try to unlock multiple iPhones simultaneously.

Time-of-Flight

Use the fact that it takes time for a photon to hit an object and return. Use an active photon emitter and the time of flight to determine the distance to an object.

Time-of-Flight

You may know the most famous version of this technology, known as LIDAR (Light Detection and Ranging).

ToF measurements like LIDAR can give incredibly accurate readings down to 0.5mm at frame rates of over 120 Hz.

Great when high-accuracy readings are needed from a single sensor with fast refresh.

Downsides: cost ($1000+ per unit) and cannot see visible wavelengths.

Multi-View Camera

Multi-View Camera

If we know the inter-camera distance and the angles to the same point, we can reconstruct how far away the point must be.

How hard can that be?

Multi-View Camera

First, we need to know the directions of the cameras and the intercamera distance very precisely.

 

Small errors in distance and facing can compound drastically over long distances. Tend to need very precisely placed cameras (e.g. in studio settings) or cameras placed together in a small physical rig.

How can we tell if it's the same point?

Finding Correspondences

Pretty hard!

Finding Correspondences is hard!

Image from a 2024 paper published from MIT

One solution

Don't try to find arbitrary correspondences between photos: track known, fixed points in the 3D world!

Aligning Scans

What type of data do these 3D cameras give us?

A pixel grid of depth values. When coupled to a regular camera, the resulting image data is often referred to as "RGBD" (RGB + Depth).

A camera cannot capture more than one side of an object at once. If we want to capture the entire object, we need to take multiple 3D readings and stitch them together somehow!

Suppose we have a really complex shape in 3D. How can we align it with another shape?

And so on...

And so on...

And so forth...

And so forth...

This is the Iterative Closest Points algorithm and it's the basic tool for aligning two shapes in 3D.

 

Does require good guesses for correspondences and initial poses (can partially be solved by machine learning).

Motion Capture for Performances

1. PLace multiple Cameras around scene

We need to precisely know the locations of these cameras and the distances between them.

These serve two key purposes:

  1. Allow us to more easily find  correspondence between the multiple views, since we're only considering real keypoints in the world instead of all possible pixels.
     
  2. Tend to be placed at key points (e.g. joints, faces) which define motion for virtual characters.

2. Place actors in goofy outfits

3. Record!

4. Find Correspondences between images to compute depth

On older films, this was literally done by a human, frame by frame.

Modern systems use machine learning and coherence of frames: if a keypoint is in one location and we're recording at 60fps, it can only have moved so far in the next frame.

But still a fair bit of error that has to be hand-corrected.

5. Generate Character Skeleton

From 3D positions of important skeleton points, we can generated the 3D positions of the character skeleton.

If needed, at this point, we might perform registration (e.g. with ICP) to fix partial captures.

6. Animate as Usual

No Index Cards

Double-check presentation schedule: a few groups had to be moved!

References

[20]: Digitizing the World

By Kevin Song

[20]: Digitizing the World

  • 117