操作系统小记（二）：内核同步方法¶

Estimated time to read: 3 minutes

本次为 Lab 2 引入链表、内核抢占时，完全忽略了单核系统中的伪并发问题，导致 Lab 2 发布的代码存在严重的竞态条件（Race Condition）漏洞。本文探讨如何在内核中实现有效的同步机制，以避免此类问题的发生。本文参考了《Linux Kernel Development》第三版第 9、10 章的很多内容。

问题背景和分析¶

这里我直接摘录 Linux Kernel Development 的章节。

Causes of Concurrency¶

In user-space, the need for synchronization stems from the fact that programs are scheduled preemptively at the will of the scheduler. Because a process can be preempted at any time and another process can be scheduled onto the processor, a process can be involuntarily preempted in the middle of accessing a critical region.

If the newly scheduled process then enters the same critical region (say, if the two processes manipulate the same shared memory or write to the same file descriptor), a race condition can occur. The same problem can occur with multiple single-threaded processes sharing files, or within a single program with signals, because signals can occur asynchronously.

This type of concurrency — in which two things do not actually happen at the same time but interleave with each other such that they might as well — is called pseudo-concurrency.

If you have a symmetrical multiprocessing (SMP) machine, two processes can actually be executed in a critical region at the exact same time. That is called true concurrency. Although the causes and semantics of true versus pseudo concurrency are different, they both result in the same race conditions and require the same sort of protection.

这段内容介绍了伪并发和真实并发的概念。对于本课程实验的单核系统来说，只存在伪并发的问题。

Causes of Concurrency in the Kernel¶

The kernel has similar causes of concurrency:

Interrupts — An interrupt can occur asynchronously at almost any time, interrupting the currently executing code.

Softirqs and tasklets — The kernel can raise or schedule a softirq or tasklet at almost any time, interrupting the currently executing code.

Kernel preemption — Because the kernel is preemptive, one task in the kernel can preempt another.

Sleeping and synchronization with user-space — A task in the kernel can sleep and thus invoke the scheduler, resulting in the running of a new process.

Symmetrical multiprocessing — Two or more processors can execute kernel code at exactly the same time.

Kernel developers need to understand and prepare for these causes of concurrency.

It is a major bug if:

An interrupt occurs in the middle of code manipulating a resource, and the interrupt handler can access the same resource.

Kernel code is preempted while accessing shared resources.

Kernel code sleeps while inside a critical section.

Two processors simultaneously access the same piece of data.

With a clear picture of what data needs protection, it is not hard to provide the locking needed to keep the system stable. The hard part is identifying these conditions and realizing that to prevent concurrency, you need some form of protection.

这段内容介绍了内核中可能引发并发的几种情况：中断、软中断和任务、内核抢占、睡眠和用户空间的同步、多核系统等，并且举例说明了可能引发竞态条件的几种情况。

Lab2 的竞态条件问题¶

Lab 2 引入的链表被广泛用于各种内核数据结构中，例如进程管理中的就绪队列和内核线程的创建队列。在 Lab 2 初始发布时，我们仅考虑了上下文的保存与恢复过程（如 Trap 处理过程、__switch_to）是不可中断的，忽略了共享数据结构如 kmem_cache、各处的链表等在并发访问时可能引发的竞态条件问题。

考虑到实验环境为单核系统，唯一的并发来源是不同线程的交错执行而非同时执行，因此我们没有选择实现锁，而是通过开关中断的方式来实现同步控制。

最开始，我们直接在临界区前后开关中断。然而这可能遇到临界区嵌套的问题：如果在一个已经关闭中断的临界区内再次关闭中断，后续的开启操作会错误地重新开启中断，导致之前的临界区保护失效。最终，我们采用了 Linux 内核中的 local_irq_save() 和 local_irq_restore() 方法来处理临界区嵌套的问题。

Linux 的同步控制¶

Linux 内核提供的同步基础设置不再介绍：

原子操作
锁：
- 自旋锁（Spinlock）
- 互斥锁（Mutex）
- 读写锁（RW Lock）
- 本地锁（Local Lock）：用于保护 per CPU 数据
RCU

我们直接开始分析内核中的同步控制实例。

临界区嵌套处理¶

本节的分析参考了 Four short stories about preempt_count()。这篇文章介绍了内核开发者关于内核代码在不同上下文中行为设计的争论，值得一读。

内核的并发来源众多，临界区类型也不同。为了正确处理临界区的嵌套，Linux 内核为每个 Task 设置了计数器，分别追踪：

Preempt 禁用的嵌套层数
被软件、硬件和 NMI 中断禁用的嵌套层数

与此相关的一堆辅助函数如下：

preempt_count()
in_hardirq()
in_nmi()
irqs_disabled()
...

通过这个计数器，内核可以正确地处理不同类型临界区的嵌套问题。

Slab¶

Create：

因该操作可能更改 Slab 分配器多层次的数据结构，使用互斥锁保护整个操作过程。

mm/slab_common.c

struct kmem_cache *__kmem_cache_create_args(const char *name,
        unsigned int object_size,
        struct kmem_cache_args *args,
        slab_flags_t flags)
{
    mutex_lock(&slab_mutex);
    //...
    goto out_unlock;
out_unlock:
    mutex_unlock(&slab_mutex);
    return s;
}

Alloc：支持快速和慢速两条路径

kmem_cache_alloc()
→ slab_alloc_node()

快速路径：
```
kfence_alloc()
```
使用原子操作分配本地 CPU 的缓存对象，无需锁。

慢速路径：

__slab_alloc_node()
→ __slab_alloc()
→ ___slab_alloc()

Slab 缓存中有一些 per CPU 数据，因此同时控制 Local Lock 和中断来防止相同 CPU 上的并发访问。

mm/slub.c

local_lock_irqsave(&s->cpu_slab->lock, flags);
freelist = get_freelist(s, slab);
lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
c->freelist = get_freepointer(s, freelist);
local_unlock_irqrestore(&s->cpu_slab->lock, flags);
return freelist;

其中的 get_freelist() 内对全局/节点的 partial/full slab list 加自旋锁，防止多 CPU 并发修改 slab 列表。

mm/slub.c

bit_spin_lock(PG_locked, &slab->__page_flags);

调度¶

调度器中主要保护就绪队列（Runqueue）。每个 CPU 有一个独立的就绪队列，用于存放该 CPU 上可运行的进程。

__schedule() 中使用自旋锁保护就绪队列：

kernel/sched/core.c

rq_lock(rq, &rf);

kernel/sched/sched.h

static inline void rq_lock(struct rq *rq, struct rq_flags *rf)
    __acquires(rq->lock)
{
    raw_spin_rq_lock(rq);
    rq_pin_lock(rq, rf);
}

在上下文切换（context_switch()）完成后的 finish_task_switch() 中释放锁：

kernel/sched/core.c

finish_lock_switch(rq);

中断也做了相应控制，在前一篇文章中介绍过。