Lecture 08: Race Conditions, Deadlock, and Data Integrity
Principles of Computer Systems
Spring 2019
Stanford University
Computer Science Department
Lecturer: Chris Gregg
- Comment on the end of last Wednesday's lecture:
Lecture 07 (review): Masking Signals and Deferring Handlers
// job-list-fixed.c
static void reapProcesses(int sig) {
while (true) {
pid_t pid = waitpid(-1, NULL, WNOHANG);
if (pid <= 0) break;
printf("Job %d removed from job list.\n", pid);
}
}
char * const kArguments[] = {"date", NULL};
int main(int argc, char *argv[]) {
signal(SIGCHLD, reapProcesses);
sigset_t set;
sigemptyset(&set);
sigaddset(&set, SIGCHLD);
for (size_t i = 0; i < 3; i++) {
sigprocmask(SIG_BLOCK, &set, NULL);
pid_t pid = fork();
if (pid == 0) {
sigprocmask(SIG_UNBLOCK, &set, NULL);
execvp(kArguments[0], kArguments);
}
sleep(1); // force parent off CPU
printf("Job %d added to job list.\n", pid);
sigprocmask(SIG_UNBLOCK, &set, NULL);
}
return 0;
}
- In discussing the job-list-fixed example (here), we discussed whether the child's signal handler could get called if the program that the child launched with
execvp
had a child of its own, and that child ended. - I mistakenly discussed what might need to be done to avoid this, but it turns out that nothing needs to be done!
- Once the original child starts another program with
execvp
all of the original code is gone. Therefore, the signal handler cannot be called, because it doesn't exist any longer. - This is distinct from the idea that blocked signals are still blocked across an
execvp
boundary.
int kill(pid_t pid, int signum);
int raise(int signum); // equivalent to kill(getpid(), signum);
-
Signal extras:
kill
andraise
- Processes can message other processes using signals via the
kill
system call. And processes can even send themselves signals usingraise
.
- The
kill
system call is analogous to the/bin/kill
shell command.- Unfortunately named, since
kill
impliesSIGKILL
implies death. - So named, because the default action of most signals in early UNIX implementations was to just terminate the target process.
- Unfortunately named, since
- We generally ignore the return value of
kill
andraise
. Just make sure you call it properly. - The
pid
parameter is overloaded to provide more flexible signaling.- When
pid
is a positive number, the target is the process with that pid. - When
pid
is a negative number less than -1, the targets are all processes within the process groupabs(pid)
. We'll rely on this in Assignment 4. -
pid
can also be 0 or -1, but we don't need to worry about those. See the man page forkill
if you're curious.
- When
- Processes can message other processes using signals via the
Lecture 07 (review): Masking Signals and Deferring Handlers
-
The
job-list-broken
andjob-list-fixed
examples from the prior slide deck highlight a key issue that comes with the introduction of signals and signal handling.- Neither
job-list-broken
norjob-list-fixed
can anticipate when a child process will finish up. That means it has no control over whenSIGCHLD
signals arrive. - Processes do, however, have some control over how they respond to
SIGCHLD
signals.- They install custom
SIGCHLD
handlers to surface information about what process exited. We've seen a lot of that already. - When a process elects to use signal handling, it shouldn't be penalized by having to live with the concurrency issue that come with it. That would only encourage programmers to avoid signals and signal handling, even when it's the best thing to do.
- That's why the kernel provides the option to defer a signal handler to run only when it can't cause problems. That's what our
job-list-fixed
program does. - It's true that the program could abuse the power to block signals for longer than necessary, but we have no choice but to assume the program wants to use signal handlers properly, else they wouldn't be installing them in the first place.
- They install custom
- Neither
Lecture 08: Race Conditions, Deadlock, and Data Integrity
-
Let's revisit the
simplesh
example from last week. The full program is right here.
- The problem to be addressed: Background processes are left as zombies for the lifetime of the shell. At the time we implemented
simplesh
, we had no choice, because we hadn't learned about signals or signal handlers yet.
Lecture 08: Race Conditions, Deadlock, and Data Integrity
// simplesh.c
int main(int argc, char *argv[]) {
while (true) {
// code to initialize command, argv, and isbg omitted for brevity
pid_t pid = fork();
if (pid == 0) execvp(argv[0], argv);
if (isbg) {
printf("%d %s\n", pid, command);
} else {
waitpid(pid, NULL, 0);
}
}
printf("\n");
return 0;
}
- Now we know about SIGCHLD signals and how to install SIGCHLD handlers to reap zombie processes. Let's upgrade our simplesh implementation to reap all process resources.
Lecture 08: Race Conditions, Deadlock, and Data Integrity
// simplesh-with-redundancy.c
static void reapProcesses(int sig) {
while (waitpid(-1, NULL, WNOHANG) > 0) {;} // nonblocking, iterate until retval is -1 or 0
}
int main(int argc, char *argv[]) {
signal(SIGCHLD, reapProcesses);
while (true) {
// code to initialize command, argv, and isbg omitted for brevity
pid_t pid = fork();
if (pid == 0) {
execvp(argv[0], argv);
printf("%s: Command not found\n", argv[0]);
exit(0);
}
if (isbg) {
printf("%d %s\n", pid, command);
} else {
waitpid(pid, NULL, 0);
}
}
printf("\n");
return 0;
}
- The last version actually works, but it relies on a sketchy call to
waitpid
to halt the shell until its foreground process has exited.- When the user creates a foreground process, normal execution flow advances to an isolated
waitpid
call to block until that process has terminated. - When the foreground process finishes, however, the
SIGCHLD
handler is invoked, and itswaitpid
call is the one that culls the foreground process's resources. - When the
SIGCHLD
handler exits, normal execution resumes, and the original call towaitpid
returns -1 to state that there is no trace of a process with the suppliedpid
. - The version on the last slide deck is effectively calling
waitpid
frommain
just to block until the foreground process vanishes. - Even if you're content with this unorthodox use of
waitpid
—i.e. invoking a system call when you know it will fail—thewaitpid
call is redundant and replicates functionality better managed in theSIGCHLD
handler.- We should only be calling
waitpid
in one place: theSIGCHLD
handler. - This will be all the more apparent when we implement shells (e.g. Assignment 4's
stsh
) where multiple processes are running in the foreground as part of a pipeline (e.g.more words.txt | tee copy.txt | sort | uniq
)
- We should only be calling
- When the user creates a foreground process, normal execution flow advances to an isolated
Lecture 08: Race Conditions, Deadlock, and Data Integrity
- Here's an updated version that's careful to call
waitpid
from only one place.
Lecture 08: Race Conditions, Deadlock, and Data Integrity
// simplesh-with-race-and-spin.c
static pid_t fgpid = 0; // global, intially 0, and 0 means no foreground process
static void reapProcesses(int sig) {
while (true) {
pid_t pid = waitpid(-1, NULL, WNOHANG);
if (pid <= 0) break;
if (pid == fgpid) fgpid = 0; // clear foreground process
}
}
static void waitForForegroundProcess(pid_t pid) {
fgpid = pid;
while (fgpid == pid) {;}
}
int main(int argc, char *argv[]) {
signal(SIGCHLD, reapProcesses);
while (true) {
// code to initialize command, argv, and isbg omitted for brevity
pid_t pid = fork();
if (pid == 0) execvp(argv[0], argv);
if (isbg) {
printf("%d %s\n", pid, command);
} else {
waitForForegroundProcess(pid);
}
}
printf("\n");
return 0;
}
- The version on the last page introduces a global variable called
fgpid
to hold the process is of the foreground process. When there's no foreground process,fgpid
is 0.- Because we don't control the signature of
reapProcesses
, we have to choice but to makefgpid
a global. - Every time a new foreground process is created,
fgpid
is set to hold that process's pid. The shell then blocks by spinning in place untilfgpid
is cleared byreapProcesses
.
- Because we don't control the signature of
- This version consolidates the
waitpid
code to reside in the handler and nowhere else. - This version introduces two serious problems, so it's far from an A+ solution.
- It's possible the foreground process finishes and
reapProcesses
is invoked on its behalfbefore
normal execution flow updatesfgpid
. If that happens, the shell will spin forever and never advance up to the shell prompt. This is a race condition, and race conditions are no-nos. - The
while (fgpid == pid) {;}
is also a no-no. This allows the shell to spin on the CPU even when it can't do any meaningful work.- It would be substantially better for
simplesh
to yield the CPU and to only be considered for CPU time when there's a chance the foreground process has exited.
- It would be substantially better for
- It's possible the foreground process finishes and
Lecture 08: Race Conditions, Deadlock, and Data Integrity
- The race condition can be cured by blocking
SIGCHLD
before forking, and only lifting that block after the globalfgpid
has been set.- Here's a version of the code that employs signal blocking to remove this race condition.
Lecture 08: Race Conditions, Deadlock, and Data Integrity
// simplesh-with-spin.c
// code for reapProcesses omitted, because it's the same as before
static void waitForForegroundProcess(pid_t pid) {
fgpid = pid;
unblockSIGCHLD(); // lift only after fgpid has been set
while (fgpid == pid) {;}
}
int main(int argc, char *argv[]) {
signal(SIGCHLD, reapProcesses);
while (true) {
// code to initialize command, argv, and isbg omitted for brevity
blockSIGCHLD();
pid_t pid = fork();
if (pid == 0) {
unblockSIGCHLD();
execvp(argv[0], argv);
}
if (isbg) {
printf("%d %s\n", pid, command);
unblockSIGCHLD();
} else {
waitForForegroundProcess(pid);
}
}
}
// simples-utils.c
// includes a collection of helper functions
static void toggleSIGCHLDBlock(int how) {
sigset_t mask;
sigemptyset(&mask);
sigaddset(&mask, SIGCHLD);
sigprocmask(how, &mask, NULL);
}
void blockSIGCHLD() {
toggleSIGCHLDBlock(SIG_BLOCK);
}
void unblockSIGCHLD() {
toggleSIGCHLDBlock(SIG_UNBLOCK);
}
Note that we call unblockSIGCHLD
in the child, before the execvp
call. We do so, because the child will otherwise inherit the signal block.
- Race condition is now gone!
- Note that we call
blockSIGCHLD
beforefork
, and we don't lift the block untilfgpid
has been set to thepid
of the new foreground process. - We also call
unblockSIGCHLD
in the child right before theexecvp
call.- The child executable could very well depend on multiprocessing. If so, it would certainly call
fork
and rely onSIGCHLD
signals and signal handling. - If we forget to call
unblockSIGCHLD
, the child process inherits theSIGCHLD
block across theexecvp
boundary. That would compromise the child ability to work properly.
- The child executable could very well depend on multiprocessing. If so, it would certainly call
- We also need to call
unblockSIGCHLD
for background processes. We do so after bookkeeping information isprintf
-ed to the screen, as we did forjob-list-fixed
. - We have not addressed the CPU spin issue, and we really need to.
- We could change the while loop from
while (fgpid == pid) {;}
towhile (fgpid == pid) {usleep(100000);}
, as we have in this version. -
usleep
call will push the shell off the CPU every time it realizes it shouldn't have gotten it in the first place. But we'd really prefer to keep the shell off the CPU until the OS has some information suggesting the foreground process is done.
- We could change the while loop from
- Note that we call
Lecture 08: Race Conditions, Deadlock, and Data Integrity
- The C libraries provide a
pause
function, which forces the process to sleep until some unblocked signal arrives. This sounds promising, because we knowfgpid
can only be changed because aSIGCHLD
signal comes in andreapProcesses
is executed.- A version of
simplesh
whosewaitForForegroundProcess
implementation relies onpause
is presented below on the left. - The problem here?
SIGCHLD
may arrive afterfgpid == pid
evaluates totrue
but before the call topause
it's committed to. That would be unfortunate, because it's possiblesimplesh
isn't managing any other processes, which means that no other signals, much lessSIGCHLD
signals, will arrive to liftsimplesh
out of itspause
call. That would leavesimplesh
in a state of deadlock. - You might think the second (lower right) version might help, but it has the same problem!
- A version of
Lecture 08: Race Conditions, Deadlock, and Data Integrity
// simplesh-with-pause-1.c
static void waitForForegroundProcess(pid_t pid) {
fgpid = pid;
unblockSIGCHLD();
while (fgpid == pid) {
pause();
}
}
// simplesh-with-pause-2.c
static void waitForForegroundProcess(pid_t pid) {
fgpid = pid;
while (fgpid == pid) {
unblockSIGCHLD();
pause();
blockSIGCHLD();
}
unblockSIGCHLD();
}
- The problem with both versions of
waitForForegroundProcess
on the prior slide is that each lifts the block onSIGCHLD
before going to sleep viapause
. - The one
SIGCHLD
you're relying on to notify the parent that the child has finished could very well arrive in the narrow space between lift and sleep. That would inspire deadlock. - The solution is to rely on a more specialized version of
pause
calledsigsuspend
, which asks that the OS change the blocked set to the one provided, but only after the caller has been forced off the CPU. When some unblocked signal arrives, the process gets the CPU, the signal is handled, the original blocked set is restored, andsigsuspend
returns.
- This is the model solution to our problem, and one you should emulate in your Assignment 3
farm
and your Assignment 4stsh
.
Lecture 08: Race Conditions, Deadlock, and Data Integrity
// simplesh-all-better.c
static void waitForForegroundProcess(pid_t pid) {
fgpid = pid;
sigset_t empty;
sigemptyset(&empty);
while (fgpid == pid) {
sigsuspend(&empty);
}
unblockSIGCHLD();
}
- Let's go through an example that is the kind of signals problem you may see on the midterm exam.
- Indeed, the problem is from a past midterm in CS 110:
- Consider this program and its execution. Assume that all processes run to completion, all system and
printf
calls succeed, and that all calls toprintf
are atomic. Assume nothing about scheduling or time slice durations.
- Consider this program and its execution. Assume that all processes run to completion, all system and
Lecture 08: Race Conditions, Deadlock, and Data Integrity
static void bat(int unused) {
printf("pirate\n");
exit(0);
}
int main(int argc, char *argv[]) {
signal(SIGUSR1, bat);
pid_t pid = fork();
if (pid == 0) {
printf("ghost\n");
return 0;
}
kill(pid, SIGUSR1);
printf("ninja\n"); return 0;
}
- For each of the five columns, write a yes or no in the header line. Place a yes if the text below it represents a possible output, and place a no otherwise.
- Let's go through an example that is the kind of signals problem you may see on the midterm exam.
- Indeed, the problem is from a past midterm in CS 110:
- Consider this program and its execution. Assume that all processes run to completion, all system and
printf
calls succeed, and that all calls toprintf
are atomic. Assume nothing about scheduling or time slice durations.
- Consider this program and its execution. Assume that all processes run to completion, all system and
Lecture 08: Race Conditions, Deadlock, and Data Integrity
static void bat(int unused) {
printf("pirate\n");
exit(0);
}
int main(int argc, char *argv[]) {
signal(SIGUSR1, bat);
pid_t pid = fork();
if (pid == 0) {
printf("ghost\n");
return 0;
}
kill(pid, SIGUSR1);
printf("ninja\n"); return 0;
}
- For each of the five columns, write a yes or no in the header line. Place a yes if the text below it represents a possible output, and place a no otherwise.
- Let's go through another example that is the kind of signals problem you may see on the midterm exam.
- Consider this program and its execution. Assume that all processes run to completion, all system and
printf
calls succeed, and that all calls toprintf
are atomic. Assume nothing about scheduling or time slice durations.
- Consider this program and its execution. Assume that all processes run to completion, all system and
Lecture 08: Race Conditions, Deadlock, and Data Integrity
int main(int argc, char *argv[]) {
pid_t pid;
int counter = 0;
while (counter < 2) {
pid = fork();
if (pid > 0) break;
counter++;
printf("%d", counter);
}
if (counter > 0) printf("%d", counter);
if (pid > 0) {
waitpid(pid, NULL, 0);
counter += 5;
printf("%d", counter);
}
return 0;
}
- List all possible outputs
- Let's go through another example that is the kind of signals problem you may see on the midterm exam.
- Consider this program and its execution. Assume that all processes run to completion, all system and
printf
calls succeed, and that all calls toprintf
are atomic. Assume nothing about scheduling or time slice durations.
- Consider this program and its execution. Assume that all processes run to completion, all system and
Lecture 08: Race Conditions, Deadlock, and Data Integrity
int main(int argc, char *argv[]) {
pid_t pid;
int counter = 0;
while (counter < 2) {
pid = fork();
if (pid > 0) break;
counter++;
printf("%d", counter);
}
if (counter > 0) printf("%d", counter);
if (pid > 0) {
waitpid(pid, NULL, 0);
counter += 5;
printf("%d", counter);
}
return 0;
}
- List all possible outputs
- Possible Output 1: 112265 Possible Output 2: 121265 Possible Output 3: 122165
- If the
>
of thecounter
> 0
test is changed to a>=
, thencounter
values of zeroes would be included in each possible output. How many different outputs are now possible? (No need to list the outputs—just present the number.)
- Let's go through another example that is the kind of signals problem you may see on the midterm exam.
- Consider this program and its execution. Assume that all processes run to completion, all system and
printf
calls succeed, and that all calls toprintf
are atomic. Assume nothing about scheduling or time slice durations.
- Consider this program and its execution. Assume that all processes run to completion, all system and
Lecture 08: Race Conditions, Deadlock, and Data Integrity
int main(int argc, char *argv[]) {
pid_t pid;
int counter = 0;
while (counter < 2) {
pid = fork();
if (pid > 0) break;
counter++;
printf("%d", counter);
}
if (counter > 0) printf("%d", counter);
if (pid > 0) {
waitpid(pid, NULL, 0);
counter += 5;
printf("%d", counter);
}
return 0;
}
- List all possible outputs
- Possible Output 1: 112265 Possible Output 2: 121265 Possible Output 3: 122165
- If the
>
of thecounter
> 0
test is changed to a>=
, thencounter
values of zeroes would be included in each possible output. How many different outputs are now possible? (No need to list the outputs—just present the number.)
- 18 outputs now (6 x the first number)
Lecture 08: Race Conditions, Deadlock, and Data Integrity
By Chris Gregg
Lecture 08: Race Conditions, Deadlock, and Data Integrity
- 2,955