POSIX 线程取消点的 Linux 实现

摘要：

这篇文章主要从一个 Linux 下一个 pthread_cancel 函数引起的多线程死锁小例子出发来说明 Linux 系统对 POSIX 线程取消点的实现方式，以及如何避免因此产生的线程死锁。

1. 一个 pthread_cancel 引起的线程死锁小例子
2. 取消点(Cancellation Point)
3. 取消类型(Cancellation Type)
4. Linux 的取消点实现
5. 对示例函数进入死锁的解释
6. 如何避免因此产生的死锁
7. 结论
8. 参考文献

1. 一个 pthread_cancel 引起的线程死锁小例子

下面是一段在 Linux 平台下能引起线程死锁的小例子。这个实例程序仅仅是使用了条件变量和互斥量进行一个简单的线程同步，thread0 首先启动，锁住互斥量 mutex，然后调用 pthread_cond_wait，它将线程 tid[0] 放在等待条件的线程列表上后，对 mutex 解锁。thread1 启动后等待 10 秒钟，此时 pthread_cond_wait 应该已经将 mutex 解锁，这时 tid[1] 线程锁住 mutex，然后广播信号唤醒 cond 等待条件的所有等待线程，之后解锁 mutex。当 mutex 解锁后，tid[0] 线程的 pthread_cond_wait 函数重新锁住 mutex 并返回，最后 tid[0] 再对 mutex 进行解锁。

1 #include <pthread.h>
2
3 pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
4 pthread_cond_t   cond = PTHREAD_COND_INITIALIZER;
5
6 void* thread0(void* arg)
7 {
8    pthread_mutex_lock(&mutex);
9    pthread_cond_wait(&cond, &mutex);
10   pthread_mutex_unlock(&mutex);
11   pthread_exit(NULL);
12 }
13
14 void* thread1(void* arg)
15 {
16   sleep(10);
17   pthread_mutex_lock(&mutex);
18   pthread_cond_broadcast(&cond);
19   pthread_mutex_unlock(&mutex);
20   pthread_exit(NULL);
21 }

22 int main()
23 {
24   pthread_t tid[2];
25   if (pthread_create(&tid[0], NULL, &thread0, NULL) != 0) {
26     exit(1);
27   }
28   if (pthread_create(&tid[1], NULL, &thread1, NULL) != 0) {
29     exit(1);
30   }
31   sleep(5);
32   pthread_cancel(tid[0]);
33
34   pthread_join(tid[0], NULL);
35   pthread_join(tid[1], NULL);
36
37   pthread_mutex_destroy(&mutex);
38   pthread_cond_destroy(&cond);
39   return 0;
40 }

看起来似乎没有什么问题，但是 main 函数调用了一个 pthread_cancel 来取消 tid[0] 线程。上面程序编译后运行时会发生无法终止情况，看起来像是 pthread_cancel 将 tid[0] 取消时没有执行 pthread_mutex_unlock 函数，这样 mutex 就被永远锁住，线程 tid[1] 也陷入无休止的等待中。事实是这样吗？

2. 取消点(Cancellation Point)

要注意的是 pthread_cancel 调用并不等待线程终止，它只提出请求。线程在取消请求(pthread_cancel)发出后会继续运行，直到到达某个取消点(Cancellation Point)。取消点是线程检查是否被取消并按照请求进行动作的一个位置。pthread_cancel manual 说以下几个 POSIX 线程函数是取消点：

pthread_join(3)
pthread_cond_wait(3)
pthread_cond_timedwait(3)
pthread_testcancel(3)
sem_wait(3)
sigwait(3)

在中间我们可以找到 pthread_cond_wait 就是取消点之一。

但是，令人迷惑不解的是，所有介绍 Cancellation Points 的文章都仅仅说，当线程被取消后，将继续运行到取消点并发生取消动作。但我们注意到上面例子中 pthread_cancel 前面 main 函数已经 sleep 了 5 秒，那么在 pthread_cancel 被调用时，thread0 到底运行到 pthread_cond_wait 没有？

如果 thread0 运行到了 pthread_cond_wait，那么照上面的说法，它应该继续运行到下一个取消点并发生取消动作，而后面并没有取消点，所以 thread0 应该运行到 pthread_exit 并结束，这时 mutex 就会被解锁，这样就不应该发生死锁啊。

3. 取消类型(Cancellation Type)

我们会发现，通常的说法：某某函数是 Cancellation Points，这种方法是容易令人混淆的。因为函数的执行是一个时间过程，而不是一个时间点。其实真正的 Cancellation Points 只是在这些函数中 Cancellation Type 被修改为 PHREAD_CANCEL_ASYNCHRONOUS 和修改回 PTHREAD_CANCEL_DEFERRED 中间的一段时间。

POSIX 的取消类型有两种，一种是延迟取消(PTHREAD_CANCEL_DEFERRED)，这是系统默认的取消类型，即在线程到达取消点之前，不会出现真正的取消；另外一种是异步取消(PHREAD_CANCEL_ASYNCHRONOUS)，使用异步取消时，线程可以在任意时间取消。

4. Linux 的取消点实现

下面我们看 Linux 是如何实现取消点的。(其实这个准确点儿应该说是 GNU 取消点实现，因为 pthread 库是实现在 glibc 中的。) 我们现在在 Linux 下使用的 pthread 库其实被替换成了 NPTL，被包含在 glibc 库中。

以 pthread_cond_wait 为例，glibc-2.6/nptl/pthread_cond_wait.c 中：

145      /* Enable asynchronous cancellation. Required by the standard. */
146      cbuffer.oldtype = __pthread_enable_asynccancel ();
147
148      /* Wait until woken by signal or broadcast. */
149      lll_futex_wait (&cond->__data.__futex, futex_val);
150
151      /* Disable asynchronous cancellation. */
152      __pthread_disable_asynccancel (cbuffer.oldtype);

我们可以看到，在线程进入等待之前，pthread_cond_wait 先将线程取消类型设置为异步取消(__pthread_enable_asynccancel)，当线程被唤醒时，线程取消类型被修改回延迟取消 __pthread_disable_asynccancel 。

这就意味着，所有在 __pthread_enable_asynccancel 之前接收到的取消请求都会等待 __pthread_enable_asynccancel 执行之后进行处理，所有在 __pthread_disable_asynccancel 之前接收到的请求都会在 __pthread_disable_asynccancel 之前被处理，所以真正的 Cancellation Point 是在这两点之间的一段时间。

5. 对示例函数进入死锁的解释

当 main 函数中调用 pthread_cancel 前，thread0 已经进入了 pthread_cond_wait 函数并将自己列入等待条件的线程列表中(lll_futex_wait)。这个可以通过 GDB 在各个函数上设置断点来验证。

当 pthread_cancel 被调用时，tid[0] 线程仍在等待，取消请求发生在 __pthread_disable_asynccancel 前，所以会被立即响应。但是 pthread_cond_wait 为注册了一个线程清理程序（glibc-2.6/nptl/pthread_cond_wait.c）：

126 /* Before we block we enable cancellation. Therefore we have to
127 install a cancellation handler. */
128 __pthread_cleanup_push (&buffer, __condvar_cleanup, &cbuffer);

那么这个线程清理程序 __condvar_cleanup 干了什么事情呢？我们可以注意到在它的实现最后（glibc-2.6/nptl/pthread_cond_wait.c）：

85 /* Get the mutex before returning unless asynchronous cancellation
86 is in effect. */
87 __pthread_mutex_cond_lock (cbuffer->mutex);
88}

哦，__condvar_cleanup 在最后将 mutex 重新锁上了。而这时候 thread1 还在休眠(sleep(10))，等它醒来时，mutex 将会永远被锁住，这就是为什么 thread1 陷入无休止的阻塞中。

6. 如何避免因此产生的死锁

由于线程清理函数 pthread_cleanup_push 使用的策略是先进后出(FILO)，那么我们可以在 pthread_cond_wait 函数前先注册一个线程处理函数：

void cleanup(void *arg)
{
pthread_mutex_unlock(&mutex);
}
void* thread0(void* arg)
{
pthread_cleanup_push(cleanup, NULL); // thread cleanup handler
pthread_mutex_lock(&mutex);
pthread_cond_wait(&cond, &mutex);
pthread_mutex_unlock(&mutex);
pthread_cleanup_pop(0);
pthread_exit(NULL);
}

这样，当线程被取消时，先执行 pthread_cond_wait 中注册的线程清理函数 __condvar_cleanup，将 mutex 锁上，再执行 thread0 中注册的线程处理函数 cleanup，将 mutex 解锁。这样就避免了死锁的发生。

7. 结论

多线程下的线程同步一直是一个让人很头痛的问题。POSIX 为了避免立即取消程序引起的资源占用问题而引入的 Cancellation Points 概念是一个非常好的设计，但是不合适的使用 pthread_cancel 仍然会引起线程同步的问题。了解 POSIX 线程取消点在 Linux 下的实现更有助于理解它的机制和有利于更好的应用这个机制。

8. 参考文献

[1] W. Richard Stevens, Stephen A. Rago: Advanced Programming in the UNIX Environment, 2nd Edition.
[2] Linux Manpage

《POSIX 线程取消点的 Linux 实现》上有1条评论

发表回复取消回复

相关阅读

《POSIX 线程取消点的 Linux 实现》上有1条评论

发表回复 取消回复

发表回复取消回复