Patching until the COWs come home

发布于 作者: Vlastimil Babka

引言

本文是Patching until the COWs come home的原文&译文。

第一部分

The kernel's memory-management subsystem is built upon many concepts, one of which is called "copy on write", or "COW". The idea behind COW is conceptually simple, but its details are tricky and its past is troublesome. Any change to its implementation can have unexpected consequences and cause subtle breakage for existing workloads. So it is somewhat surprising that last year we saw two major changes the kernel's COW code; less surprising is the fact that, both times, these changes had unexpected consequences and broke things. Some of the resulting problems are still not fixed today, almost ten months after the first change, while the original reason for the changes — a security vulnerability — is also not fully fixed. Read on for a description of COW, the vulnerability, and the initial fix; the concluding article in the series will describe the complications that arose thereafter.

内核的内存管理子系统基于许多概念,其中之一被称为"写时复制",或"COW"。COW 背后的思想在概念上很简单,但其细节复杂且其历史充满问题。对其实现的任何更改都可能产生意想不到的后果,并导致现有工作负载出现微妙的问题。因此,去年我们看到了内核 COW 代码的两个重大变化,这多少有些令人惊讶;而这两个变化都产生了意想不到的后果并导致了一些问题,这一点就不那么令人惊讶了。一些由此产生的问题至今仍未解决,在第一次更改近十个月后,而更改的最初原因——一个安全漏洞——也尚未完全修复。继续阅读,了解 COW、漏洞和初始修复的描述;系列结论文章将描述此后出现的问题。

Copy on write is a standard mechanism for sharing a single instance of an object between processes in a situation where each process has the illusion of an independent, private copy of that object. Examples include memory pages shared between processes or data extents shared between files. To see how COW is used in the memory-management subsystem, consider what happens when a process calls fork(): the pages in that process's private memory areas should no longer be shared between the parent and child. But, instead of creating new copies of those pages for the child process during the fork() call, the kernel will simply map the parent's pages in the child's page tables. Importantly, the page-table entries in both parent and child are set as read-only (write-protected).

写时复制是一种标准机制,用于在多个进程间共享单个对象实例,每个进程都感觉拥有该对象的独立、私有副本。例如,进程间共享的内存页或文件间共享的数据范围。要了解写时复制在内存管理子系统中的使用方式,可以思考当进程调用 fork() 时会发生什么:该进程的私有内存区域中的页不再与父进程共享。但是,在 fork() 调用期间,内核不会为子进程创建这些页面的新副本,而是将父进程的页面映射到子进程的页表中。重要的是,父进程和子进程中的页表项都被设置为只读(写保护)。

If either process attempts to write to one of these pages, a page fault will occur, and the kernel's page-fault handler will create a new copy of the page, replacing the page-table entry (PTE) in the faulting process with a PTE that references the new page, but which allows the write to proceed. This action is often referred to as "breaking COW". If the other process then tries to write to that same page, another page fault will occur, as that process's PTE is still marked read-only. But now the page-fault handler will recognize that the page is no longer shared, so the PTE can just be made writable and the process can resume.

如果任一进程尝试写入这些页面之一,将会发生页面错误,并且内核的页面错误处理程序将创建页面的新副本,用指向新页面的 PTE(页表项)替换出错进程中的 PTE,但允许写入继续进行。此操作通常被称为"打破 COW"。如果另一个进程随后尝试写入同一页面,又会发生页面错误,因为该进程的 PTE 仍然标记为只读。但现在页面错误处理程序会识别页面不再共享,因此 PTE 可以直接设为可写,进程即可继续执行。

The benefits of this scheme are lower memory consumption and a reduction of CPU time spent copying pages during fork() calls. Often the price of copying is never paid for many of the pages because the child might call exit() or exec() before either the parent or the child writes to those pages.

该方案的优点是降低内存消耗,并减少在 fork() 调用期间复制页面的 CPU 时间。由于子进程可能在父进程或子进程写入这些页面之前调用 exit() 或 exec() ,因此许多页面的复制成本实际上从未被支付。

While the COW mechanism looks simple, the devil is in the details, as has been shown already in the past. The recent trouble in this area started in 2020; it resulted in two major changes while attempting to fix a vulnerability — which is actually still not fixed in all scenarios — and resulted in many corner cases, some of which are still not fully ironed out.

虽然 COW 机制看起来简单,但细节中的魔鬼早已在过去的讨论中显现。近期在这个领域的麻烦始于 2020 年;在试图修复一个漏洞的过程中,导致了两次重大变更——而这个漏洞实际上在所有场景下仍未被修复——并导致了许多边界情况,其中一些至今仍未完全解决。

The trouble begins 麻烦开始

The first public sign of issues with the COW mechanism appeared in the form of commit 17839856fd58 ("gup: document and work around 'COW can break either way' issue") at the end of May 2020. The changelog doesn't fully describe the problem scenario, but what is there is ominous enough:

COW 机制问题的第一个公开迹象出现在 2020 年 5 月底的提交记录 17839856fd58("gup: 记录和解决'COW 可能以任何方式失效'的问题")。变更日志并未完全描述问题场景,但其中所描述的内容已经足够令人担忧:

End result: the get_user_pages() call might result in a page pointer that is no longer associated with the original VM, and is associated with - and controlled by - another VM having taken it over instead.

最终结果:get_user_pages()调用可能会返回一个不再与原始 VM 关联的页指针,而是与另一个接管了该页的 VM 关联并由其控制。

Any doubts about whether the commit fixed a security vulnerability vanish when one notices the Reported-by tag mentioning Jann Horn; presumably Horn's report went through the appropriate non-public security channels. The practice of making fixes to some vulnerabilities immediately public without explicitly marking them as such is not new, especially in the COW area. Nevertheless, the related Project Zero issue was made public in August, and CVE-2020-29374 was assigned in December; both point to the above-mentioned commit as the fix.

对提交是否修复了安全漏洞的任何疑虑,当注意到 Reported-by 标签提到了 Jann Horn 时便烟消云散;据推测,Horn 的报告是通过了适当的非公开安全渠道。立即公开某些漏洞的修复而不明确标记这一做法并不新鲜,尤其是在 COW 领域。尽管如此,相关的 Project Zero 问题在 8 月份公开,CVE-2020-29374 在 12 月份被分配;两者都指向上述提交作为修复。

As the Project Zero issue includes proof-of-concept (PoC) code, we can look at the fix with that code in mind and not rely on the incomplete commit log. The most important parts of the PoC are the following: 由于 Project Zero 问题包括概念验证(PoC)代码,我们可以带着这段代码的思路来审视修复,而不依赖于不完整的提交日志。PoC 最重要的部分如下:

    static void *data;

    posix_memalign(&data, 0x1000, 0x1000);
    strcpy(data, "BORING DATA");

    if (fork() == 0) {
	// child
	int pipe_fds[2];
	struct iovec iov = {.iov_base = data, .iov_len = 0x1000 };
	char buf[0x1000];

	pipe(pipe_fds);
	vmsplice(pipe_fds[1], &iov, 1, 0);
	munmap(data, 0x1000);

	sleep(2);
	read(pipe_fds[0], buf, 0x1000);
	printf("read string from child: %s\n", buf);
   } else {
	// parent
	sleep(1);
	strcpy(data, "THIS IS SECRET");
   }

The code starts by allocating an anonymous, private page and writing some data there; it then calls fork(). At that point, the page becomes a COW page — it is write-protected for the parent process by making the corresponding page-table entry read-only, and for the child process an identical PTE is created. Then, while the parent is blocked inside sleep(), the child creates a pipe and passes the page to that pipe with vmsplice(), a system call that is similar to write() but which allows a zero-copy data transfer of the page's contents. In order to achieve that, the kernel takes a reference on the source page (by increasing its reference count) through get_user_page() or one of its variants; the set of these functions is often referred to as "GUP". The child then unmaps the page from its own page tables (but retains the reference in the pipe) and goes to sleep.

代码首先分配一个匿名的私有页面并写入一些数据;然后调用 fork() 。此时,该页面成为写时复制页面——通过将相应的页表项设置为只读来保护父进程,并为子进程创建一个相同的 PTE。然后,当父进程在 sleep() 中阻塞时,子进程创建一个管道,并使用 vmsplice() 将页面传递给该管道,这是一个类似于 write() 但允许页面内容零拷贝传输的系统调用。为了实现这一点,内核通过 get_user_page() 或其变体在源页面上获取一个引用(通过增加其引用计数);这些函数的集合通常被称为"GUP"。然后,子进程将其页表中的页面取消映射(但在管道中保留引用),并进入睡眠状态。

The parent wakes up from its sleep and writes new data to the page. The page table entry is write-protected, so the write causes a page fault. The page-fault handler can tell that this is fault on a COW page because the the mapping allows write access while the PTE is write protected. If there were more processes mapping the page then the content would have to be copied (breaking COW), but if there is a single mapping, the page can be just made writeable. The kernel relies on the value returned by page_mapcount() to determine how many mappings exist.

父进程从睡眠中醒来,并将新数据写入页面。由于页面表项是写保护的,因此写入操作会触发页面错误。页面错误处理程序可以判断这是对 COW 页面的错误,因为映射允许写访问,而 PTE 是写保护的。如果还有其他进程映射该页面,则内容必须被复制(这将破坏 COW),但如果只有一个映射,页面可以直接设置为可写。内核依赖于 page_mapcount() 返回的值来确定存在多少个映射。

Here is the problem: page_mapcount() at this point in the PoC's execution includes only the parent's mapping, because the child has already called munmap() on that page. This function does not take into account the fact that the child can still access the parent's page through the pipe; it ignores the elevated page reference count. Thus, the kernel allows the parent process to write new data into the page, which is no longer considered to be a COW page. Finally, the child wakes up and reads that new data from the pipe, which might include sensitive information that the parent did not expect the child to see.

问题是:在 PoC 执行的这个阶段, page_mapcount() 仅包含父进程的映射,因为子进程已经在这个页面上调用了 munmap() 。这个函数没有考虑到子进程仍然可以通过管道访问父进程的页面,它忽略了提升的页面引用计数。因此,内核允许父进程向页面写入新数据,而该页面不再被视为 COW 页面。最后,子进程醒来并从管道中读取这些新数据,这些数据可能包括父进程没有期望子进程看到的敏感信息。

Corralling the problem 控制问题

One might rightfully ask why this potential of leaking data from parent to child can matter in practice, as both processes are normally executing the code from the same binary and the fork() only acts as a branch in the code. So we can assume that, either the binary is trusted and thus the child process is too, or it is not and then we probably should not let the parent access any sensitive data in the first place. And, in the scenario where fork() from a trusted binary is followed by an exec() of a potentially malicious binary, exec() removes all shared pages from the address space of the child process before loading the new binary.

人们可能会合理地询问,为什么从父进程到子进程的数据泄露潜力在实践中可能很重要,因为这两个进程通常都在执行同一个二进制的代码,而 fork() 仅作为代码中的一个分支。因此,我们可以假设,要么二进制是可信的,那么子进程也是可信的,要么它不可信,那么我们本来就不应该让父进程访问任何敏感数据。并且,在从可信二进制中的 fork() 跟随到潜在恶意二进制的 exec() 的场景中, exec() 会在加载新二进制之前从子进程的地址空间中移除所有共享页面。

But, as the Project Zero issue mentions, there are environments, such as Android, where each process is forked from a zygote process without a subsequent exec(), for performance reasons. That could lead to a situation that looks a lot like the PoC exploit for this bug.

但是,正如 Project Zero 问题所提到的,在 Android 等环境中,出于性能考虑,每个进程都是从 zygote 进程中 fork 出来的,而没有后续的 exec() ,这可能导致一种看起来非常像这个 bug 的 PoC 漏洞利用的情况。

Moreover, the vmsplice() syscall might just be a symptom of a broader issue, since there are many other callers of the GUP functions in the kernel. So it is a good idea in general not to let a child process hold on to a page shared through the COW mechanism with the parent while letting the parent write new contents to the page.

此外, vmsplice() 系统调用可能只是更广泛问题的症状,因为内核中还有许多其他调用 GUP 函数的代码。因此,通常来说,不让子进程持有通过 COW 机制与父进程共享的页面,同时让父进程写入新内容到该页面,是一个好主意。

To prevent exploits of this behavior, commit 17839856fd58 made it impossible to get a reference (even a read-only reference) via GUP to a COW-shared page. All such attempts now result in breaking COW and returning a reference to the new copy instead. Thus, in the PoC code above, calling vmsplice() now causes the child process to replace the shared COW page in the corresponding page table entry with a new page, which is then passed to the pipe. Afterward, the child no longer has any way to access the parent's page and the new contents written there.

为了防止利用这种行为,提交 17839856fd58 使通过 GUP 获取对 COW 共享页面的引用(即使是只读引用)变得不可能。现在所有此类尝试都会破坏 COW 并返回对新副本的引用。因此,在上面的 PoC 代码中,调用 vmsplice() 现在会导致子进程用新页面替换对应页表项中的共享 COW 页面,然后将该页面传递给管道。之后,子进程将不再有任何方式访问父进程的页面及其上写入的新内容。

The commit notes the possibility of worse performance for some GUP users, especially those that rely on a lockless variant of the interface like get_user_pages_fast(). The changelog continues that finer-grained rules could be added later for situations where it is clear that it is safe to keep sharing the COW page because it can never be overwritten with new, potentially sensitive contents. The system-wide zero-page would be one example of this sort of situation. But otherwise, Linus Torvalds (the author of the change) expected no fundamental issues with this aggressively COW-breaking approach for GUP. Linux 5.8 was duly released with this commit.

该提交备注了某些 GUP 用户可能会遇到性能下降的情况,尤其是那些依赖无锁变体接口的用户,如 get_user_pages_fast() 。变更日志继续说明,未来可能会为那些明确可以安全共享 COW 页面的情况添加更细粒度的规则,因为这种页面永远不会被新内容(可能包含敏感信息)覆盖。系统级的零页就是这种情况的一个例子。但除此之外,Linus Torvalds(该变更的作者)预计这种激进打破 COW 的方法对 GUP 不会存在根本性问题。Linux 5.8 如期发布了这个提交。

And this, one might think, was the end of the problem. But, as was mentioned at the outset, COW is a complicated and subtle beast. In truth, the problems were just beginning. The second half of this article will delve into how the COW fix led to a stampede of new problems that still have yet to be completely solved.

而人们或许会认为,问题到此为止了。然而,正如一开始所提到的,COW 是一个复杂而微妙的存在。事实上,问题才刚刚开始。本文的后半部分将深入探讨 COW 修复如何引发了一系列新问题,这些问题至今仍未完全解决。

第二部分

Part 1 of this series described the copy-on-write (COW) mechanism used to avoid unnecessary copying of pages in memory, then went into the details of a bug in that mechanism that could result in the disclosure of sensitive data. A patch written by Linus Torvalds and merged for the 5.8 kernel appeared to fix that problem without unfortunate side effects elsewhere in the system. But COW is a complicated beast and surprises are not uncommon; this particular story was nowhere near as close to an end as had been thought.

本系列的第一个部分描述了用于避免内存中页面不必要复制的写时复制(COW)机制,然后深入探讨了该机制中的一个可能导致敏感数据泄露的漏洞。由林纳斯·托瓦尔德斯编写的补丁合并到 5.8 内核中,似乎解决了这个问题,而没有在系统的其他地方产生不幸的副作用。但 COW 是一个复杂的生物,意外并不少见;这个故事远没有人们最初认为的那么接近结束。

Torvalds's expectations quickly turned out to be overly optimistic. In August 2020, a bug was reported by Peter Xu; it affected userfaultfd(), which is a subsystem for handling page faults in a user-space process. This mechanism allows the handling process to (among other things) write-protect ranges of memory and be notified of attempts to write to that range. One use case for this feature is to prevent pages from being modified while the monitoring process writes their contents to secondary storage. That write can, however, result in a read-only get_user_pages() (GUP) call on the write-protected pages, which should be fine. Remember, though, that Torvalds's fix worked by changing read-only get_user_pages() calls to look like calls for write access; this was done to force the breaking of COW references on the pages in question. In the userfaultfd() case, that generates an unexpected write fault in the monitoring process, with the result that this process hangs.

托瓦尔德斯的期望很快被证明过于乐观。2020 年 8 月,彼得·徐报告了一个漏洞;该漏洞影响了 userfaultfd() ,这是一个用于处理用户空间进程页面错误的子系统。该机制允许处理进程(在其它事情中)写保护内存范围,并通知尝试写入该范围的尝试。这个特性的一个用例是防止在监控进程将它们的内容写入辅助存储时修改页面。然而,这种写操作可能导致在写保护的页面上进行只读 get_user_pages() (GUP)调用,这应该没问题。但请记住,托瓦尔德斯的修复是通过将只读 get_user_pages() 调用改为看起来像写访问的调用来工作的;这是为了强制在相关页面上打破 COW 引用。在 userfaultfd() 情况下,这会在监控进程中生成一个意外的写故障,结果导致该进程挂起。

The initial version of Xu's fix went in the direction of more fine-grained rules for breaking COW by GUP, as had been anticipated in the original fix, and added some userfaultfd()-specific handling. But during the discussion, Torvalds instead proposed a completely different approach, which resulted in another patch set from Xu. These patches essentially revert Torvalds's change and abandon the approach of always breaking COW for GUP calls. Instead, do_wp_page(), which handles write faults to a write-protected page, is modified by commit 09854ba94c6a ("mm: do_wp_page() simplification") to more strictly check if the page is shared by multiple processes.

徐的初始修复版本按照预期,朝着为 GUP 打破 COW 提供更细粒度的规则方向发展,并增加了一些针对 userfaultfd() 的特定处理。但在讨论过程中,托瓦尔德(Torvalds)提出了完全不同的方法,这导致了徐提交了另一组补丁。这些补丁基本上撤销了托瓦尔德的变更,并放弃了为 GUP 调用始终打破 COW 的方法。相反,处理写入受保护页面的 do_wp_page() 被提交 09854ba94c6a("mm: do_wp_page()简化")更严格地检查页面是否被多个进程共享。

With that commit in place, writes to a page without breaking COW are only allowed if the page is mapped by a single process and its reference count is not otherwise elevated (such as by an outstanding GUP). In the original vmsplice() PoC, the result is that the child process calling vmsplice() is again able to get the reference to the page shared with the parent, and retain the reference after unmapping the page from its own page tables. However, as soon as the parent tries to write to the page, the page-fault handler notices its elevated reference count and decides to break COW, giving the parent a new copy that the child cannot access. The idea and implementation is also simpler and should have performance benefits thanks to reduced page locking, as Torvalds expected and the Intel testing bot later confirmed.

有了这个提交,只有当页面由单个进程映射且其引用计数没有被其他方式提升(例如,通过未完成的 GUP)时,才允许写入页面而不破坏 COW。在原始的 vmsplice() PoC 中,结果是调用 vmsplice() 的子进程能够再次获取与父进程共享的页面的引用,并在从自己的页表取消映射页面后保留该引用。然而,一旦父进程尝试写入页面,页面异常处理程序注意到其引用计数被提升,并决定破坏 COW,给父进程一个子进程无法访问的新副本。这个想法和实现也更为简单,并且由于减少了页面锁定,应该会带来性能上的好处,正如 Torvalds 预期的那样,后来 Intel 测试机器人也证实了这一点。

Another bug was reported in September by Mikulas Patocka, who observed that running strace on a DAX filesystem (a filesystem providing direct access to the persistent memory on which it is stored) triggers a kernel warning and kills the traced process. That bug was apparently not fully analyzed for the root cause, but git bisect pointed to the same COW fix, and the patches intended to fix the userfaultfd() bug also fixed the strace bug.

又有另一个漏洞是在 9 月被 Mikulas Patocka 报告的,他观察到在 DAX 文件系统(一种可以直接访问其存储的持久内存的文件系统)上运行 strace 会触发内核警告并杀死被追踪的进程。这个漏洞显然没有完全分析其根本原因,但 git bisect 指向了相同的 COW 修复,而旨在修复 userfaultfd() 漏洞的补丁也修复了 strace 漏洞。

Trouble with RDMA RDMA 问题

At the time, apparently only Hugh Dickins was concerned about relying on the elevated reference count for this decision. He cited several examples where this kind of approach had created problems in the 2.6 days. His worries went unanswered, though, and the patch set was merged by Torvalds few days later; it was released in Linux 5.9-rc5. But Dickins's worries turned out to be justified; Jason Gunthorpe reported just one day later that the new approach broke the RDMA self-tests.

当时,显然只有 Hugh Dickins 对这个决策依赖提升的引用计数表示担忧。他引用了几个在 2.6 时代这种做法造成问题的例子。然而,他的担忧没有得到回应,几天后 Torvalds 合并了补丁集;它发布在 Linux 5.9-rc5 中。但 Dickins 的担忧最终被证明是合理的;Jason Gunthorpe 在一天后报告说新的方法破坏了 RDMA 自检测试。

The problem this time appears to be that the RDMA self-test creates an anonymous private mapping and then calls a special form of GUP called pin_user_pages() which is, broadly speaking, a kernel interface allowing drivers (such as RDMA) to ensure that pages do not go away from under them while they perform data transfers to or from those pages. Then the self-test calls fork() to spawn a short-lived child. The child process does not actually touch the pages but, due to the fork() call, the page-table entries become write-protected for COW, and remain write-protected after the child exits. Further writes from the parent process to the pages are expected to modify the pages pinned for the RDMA operations. But, due to the elevated reference count, the writes result in an "unexpected" COW break, even if the process is the only one mapping the pages — a direct result of commit 09854ba94c6a.

这次的问题似乎在于 RDMA 自检创建了一个匿名的私有映射,然后调用了一种特殊的 GUP 形式 pin_user_pages() 。从广义上讲,这是一个内核接口,允许驱动程序(如 RDMA)在向或从这些页面传输数据时确保页面不会从它们下面消失。然后自检调用 fork() 来创建一个短期子进程。子进程实际上并没有接触这些页面,但由于 fork() 调用,页表项对 COW 变为写保护状态,并且在子进程退出后仍然保持写保护状态。父进程对这些页面的进一步写入预计会修改为 RDMA 操作保留的页面。但由于引用计数被提升,即使进程是唯一映射这些页面的进程,写入也会导致一个"意外"的 COW 中断——这是提交 09854ba94c6a 的直接结果。

Note that it would be possible to fix this test by calling madvise(MADV_DONTFORK) on the mapping submitted to RDMA, which would prevent that mapping from being included in the child's address space. That change would make the test more robust, because relying on the child process to exit quickly enough might be unreliable and lead to unexpected COW breaks even before commit 09854ba94c6a. However, there might be other programs that use RDMA and do not call madvise(MADV_DONTFORK) before fork(); even if these programs might not be fully robust, it would not be acceptable to break them with a kernel change. Also, as Gunthorpe noted, it is not easy to fix every RDMA (or page-pinning in general) user, even when one wants to.

请注意,可以通过在提交给 RDMA 的映射上调用 madvise(MADV_DONTFORK) 来修复此测试,这将阻止该映射被包含在子进程的地址空间中。这个改动会使测试更健壮,因为依赖子进程快速退出可能不可靠,并且可能导致在提交 09854ba94c6a 之前就出现意外的 COW 中断。然而,可能存在其他使用 RDMA 但在 fork() 之前不调用 madvise(MADV_DONTFORK) 的程序;即使这些程序可能不够健壮,使用内核更改来破坏它们也是不可接受的。此外,正如 Gunthorpe 指出的,即使想要修复,也很难修复所有 RDMA(或通用页面固定)用户。

The 5.9 kernel was late in the release-candidate phase at this point, so an urgent fix was needed; it eventually appeared in the form of yet another patch set from Xu. There was no fundamental change of approach this time. During a fork() call, if any pinned pages are encountered, the child will get a copy immediately instead of sharing a COW page with the parent. The test for a page being pinned is not exact and may have false-positive results if the page has a significantly increased reference count for other reasons, but copying a few more pages during fork() than is strictly needed should not hurt performance. To minimize the increased fork() overhead, a preparatory patch caused the pin_user_pages() call to mark the process with a flag indicating that the process has pinned some pages at some point; that allows the newly added checks for pinned pages to be skipped for all processes that do not call pin_user_pages() before fork(), which should be the majority of them.

此时,5.9 内核的发布候选阶段进展缓慢,因此需要紧急修复;最终以徐某提交的又一个补丁集的形式出现。这次没有根本性的方法改变。在 fork() 调用期间,如果遇到任何固定页面,子进程将立即获取副本,而不是与父进程共享 COW 页面。检查页面是否被固定的测试并不精确,如果页面由于其他原因参考计数显著增加,可能会出现误报结果,但比严格需要多复制几页在 fork() 期间应该不会影响性能。为了尽量减少增加的 fork() 开销,一个预备补丁导致 pin_user_pages() 调用用标志标记进程,表明该进程在某个时刻固定了一些页面;这允许对所有在 pin_user_pages() 之前没有调用 fork() 的进程跳过新添加的固定页面检查,这应该是大多数进程的情况。

With some minor follow-up fixes, the final 5.9 kernel was released in October with the RDMA issue addressed. More followup work was done to address a theoretical case where the parent process would perform page pinning in parallel with a fork() call; those patches were eventually merged for 5.11-rc1.

通过一些后续的小修复,最终 5.9 内核在 10 月份发布,RDMA 问题得到了解决。为了处理一个理论上的情况——即父进程在 fork() 调用时并行执行页面固定——进行了更多的后续工作,这些补丁最终被合并到 5.11-rc1 中。

An unwanted holiday present 一个不想要的假日礼物

That should have concluded our saga. Shortly before Christmas, though, Nadav Amit reported a userfaultfd() self-test failure that was eventually linked by Yu Zhao to commit 09854ba94c6a. Again, a write-protected page (created by a userfaultfd() operation) was being copied due to its elevated reference count, this time leaving another CPU with a stale TLB entry pointing to the original page. The rather complicated full scenario is described in the latest version of the fix, which notes that the missing TLB flush is actually an old bug, but the more aggressive COW breaking of commit 09854ba94c6a made it visible.

这应该结束了我们的故事。然而,圣诞节前夕,Nadav Amit 报告了一个 userfaultfd() 自检失败,最终由 Yu Zhao 将其关联到提交记录 09854ba94c6a。同样,由于高引用计数,一个受写保护的页面(由 userfaultfd() 操作创建)被复制,这次导致另一个 CPU 有一个指向原始页面的过时 TLB 条目。相当复杂的完整场景在最新版本的修复中进行了描述,该修复指出缺失的 TLB 刷新实际上是一个旧问题,但提交记录 09854ba94c6a 更激进的 COW 破坏使其变得可见。

A similar problem was reported by Zhao for the soft-dirty mechanism, which allows the monitoring (with low overhead but also low granularity) of which pages a process has written to. This is done by writing to the /proc/pid/clear_refs file, which causes all of the process's page-table entries to become write-protected. The page-fault handler then, in response to write faults, sets a soft-dirty bit in the page-table entry; these bits can be read from /proc/pid/pagemap to determine which pages were written to since the clear_refs operation.

赵报告了一个类似的问题,涉及软脏机制,该机制允许以低开销但粒度也较低的方式监控进程已写入的页面。这是通过写入 /proc/pid/clear_refs 文件实现的,这会导致进程的所有页表项变为写保护。随后,页错误处理程序在响应写错误时,在页表项中设置软脏位;这些位可以从 /proc/pid/pagemap 读取,以确定自 clear_refs 操作以来已写入的页面。

Investigation of these bugs led to a long discussion during which even more issues became apparent, and there was some disagreement about the best approach to fix them. As Zhao noted, the core issue is a race between two actions within the kernel:

对这些错误的调查引发了一场长时间的讨论,期间出现了更多问题,并且对于如何修复它们的最佳方法存在一些分歧。正如赵所指出的,核心问题是在内核中两个操作之间的竞争:

The page-fault handler copying pages (where it previously wouldn't have), and

页错误处理程序复制页面(这是它之前不会做的),以及

The associated page-table entries being modified by either change_pte_range(), which is used by userfaultfd(), or clear_soft_dirty(), which handles writes to /proc/pid/clear_refs.

相关的页表项被 change_pte_range() 修改, change_pte_range() 被 userfaultfd() 使用,或者被 clear_soft_dirty() 修改, clear_soft_dirty() 处理写入 /proc/pid/clear_refs 的操作。

Both of the above actions happen under the mmap_lock, but that lock is taken for reading, allowing the actions to proceed concurrently. Torvalds argued that these write-protecting operations should thus simply take the mmap_lock for writing. Andrea Arcangeli was, however, unhappy about the solution, citing the absence of write locking as one of the advantages of userfaultfd() over mprotect().

上述两种操作都在 mmap_lock 下进行,但这个锁是以读模式获取的,允许这些操作并发进行。Torvalds 认为,因此这些写保护操作应该简单地以写模式获取 mmap_lock 。然而,Andrea Arcangeli 对这一解决方案并不满意,他提到写锁的缺失是 userfaultfd() 相对于 mprotect() 的一个优势。

Arcangeli proposed a different fix for the userfaultfd() issue; it was more complicated, but avoided taking the mmap_lock for writing. However, in the cover letter, he also argued that the page-reference-count-based test for breaking COW was still problematic and that the issues being fixed were just "the tip of the iceberg", with more breakage to be expected. Thus, he said, it would be best if all commits merged so far to fix the original vmsplice() vulnerability were reverted, and vmsplice() should become a privileged operation until specifically fixed. He also echoed Hugh's earlier worries about the approach of breaking COW depending solely on the elevated reference count, and said that, if there were issues with the "GUP causes COW break" approach of commit 17839856fd58, they should have been fixed instead. As a result, he self-NAKed his own userfaultfd() fix.

Arcangeli 提出了一个不同的修复方案来解决 userfaultfd() 问题;这个方案更复杂,但避免了使用 mmap_lock 进行写操作。然而,在封面信中,他还认为基于页面引用计数的 COW 断裂测试仍然存在问题,并且正在修复的问题只是“冰山一角”,预计还会有更多问题出现。因此,他说,最好是撤销所有到目前为止合并的用于修复原始 vmsplice() 漏洞的提交,而 vmsplice() 应该成为一个特权操作,直到被特别修复。他还重申了 Hugh 之前对仅依赖提升引用计数来破坏 COW 方法的担忧,并表示,如果提交 17839856fd58 中“GUP 导致 COW 断裂”的方法存在问题,它们应该已经被修复了。结果,他否定了自己的 userfaultfd() 修复方案。

Torvalds, however, was not convinced; he noted that both approaches tried so far have corner cases, but the current one is conceptually much simpler for the core memory-management subsystem. He said that it would be better to stick to that one and deal with the exotic corner cases. Arcangeli then highlighted one concrete example of his worries by stating that the vmsplice() vulnerability becomes reproducible again after commit 09854ba94c6a if the PoC code is changed to use a transparent huge page (THP) instead of a base (4096-byte) page. He included the necessary patch for those who had access to the original reproducer, not aware that it was made public with the rest of the Project Zero issue. The author has verified that the patched PoC indeed reproduces the issue as of the 5.12-rc2 kernel.

然而,Torvalds 并不认同;他指出迄今为止尝试的两种方法都有特殊情况,但当前的方法对核心内存管理子系统在概念上要简单得多。他说最好坚持这个方法,并处理那些特殊的情况。Arcangeli 随后通过指出一个具体的例子来强调他的担忧,即如果 PoC 代码改为使用透明大页面(THP)而不是基础(4096 字节)页面,那么在提交 09854ba94c6a 之后, vmsplice() 漏洞又会变得可重现。他附上了必要的补丁,供那些有原始重现器访问权限的人使用,并未意识到该补丁与 Project Zero 的其他问题一起被公开。作者已经验证过,在 5.12-rc2 内核中,补丁后的 PoC 确实重现了该问题。

Although the issue was acknowledged in subsequent discussion, it hasn't been fixed yet. The problem is that, while the code handling write faults on write-protected, base pages relies on the page's reference count, the THP variant relies on page_trans_huge_mapcount() which is equivalent to the page's mapping count. As explained earlier, the mapping count is equal to one after the child process unmaps the page from its own address space, even though the child retains access to the page through the vmsplice() pipe. Gunthorpe suggested that the GUP call performed by vmsplice() should be adjusted to break COW immediately — an approach that's again similar to the original fix in commit 17839856fd58, but limited only to a class of long-term pins by GUP, including those created by vmsplice(). So far, that idea doesn't seem to have been implemented; even if it were, there might be other, less obvious ways beside vmsplice() to exploit the underlying issue, although that concern might be just theoretical.

尽管该问题在后续讨论中得到了承认,但至今仍未得到修复。问题在于,处理写保护基础页面的写故障的代码依赖于页面的引用计数,而 THP 变体则依赖于 page_trans_huge_mapcount() ,它等同于页面的映射计数。如前所述,当子进程将其从自己的地址空间中取消映射页面后,映射计数等于一,尽管子进程仍通过 vmsplice() 管道访问该页面。Gunthorpe 建议 vmsplice() 执行的 GUP 调用应进行调整,以立即中断 COW——这种方法再次类似于提交记录 17839856fd58 中的原始修复,但仅限于由 GUP 创建的一类长期固定,包括 vmsplice() 创建的那些。到目前为止,该想法似乎尚未得到实现;即使它被实现,除了 vmsplice() 之外,可能还有其他、不太明显的方式来利用潜在问题,尽管这种担忧可能只是理论上的。

New rules 新规则

The last round of discussion so far (at least on the public mailing lists) occurred around the middle of January 2021. Arcangeli again proposed to effectively restore the state before Linux 5.8 and deal with the vmsplice() vulnerability in a different way. He included a PoC to demonstrate that, with the current page-reference-count-based approach for breaking COW, data loss can happen. A process that performs an O_DIRECT read() from a file to a buffer, while simultaneously writing (by the CPU) to a different buffer within the same page from another thread, might effectively lose the data being read if a third thread (or a different process) writes to the /proc/pid/clear_refs file. That constitutes an ABI break and, unlike the earlier TLB flush issues, is not fixed by taking mmap_lock for writing. He also reiterated the unfixed vmsplice() vulnerability with THPs.

到目前为止(至少在公开邮件列表上)的最后一轮讨论发生在 2021 年 1 月中旬。Arcangeli 再次提议有效恢复 Linux 5.8 之前的系统状态,并以不同的方式处理 vmsplice() 漏洞。他包含了一个 PoC 来证明,使用当前的基于页面引用计数的 COW 破坏方法,数据丢失是可能发生的。一个进程从文件执行 O_DIRECT read() 到缓冲区,同时 CPU 在同一页内的不同缓冲区中由另一个线程写入,如果第三个线程(或不同的进程)写入 /proc/pid/clear_refs 文件,正在读取的数据可能会有效丢失。这构成了 ABI 中断,与早期的 TLB 刷新问题不同,这不是通过采取 mmap_lock 写入来修复的。他还重申了 THPs 中未修复的 vmsplice() 漏洞。

Torvalds replied that, instead of the revert, the clear_refs implementation should be fixed and included a draft patch to that effect (he later merged this patch into 5.11-rc4 as commit 9348b73c2e1bf). The idea is that, if a page appears to be pinned, the clear_refs processing will simply not write-protect its page-table entry, and thus, later, the write from a CPU will not cause a page fault and COW break. Users of the soft-dirty mechanism will see the page as always dirty, which should be a fair result in the presence of DMA traffic that is allowed to write to the page. Then he expanded upon the general rules governing how to deal with pinning of pages for DMA transfers and write-protecting. They can be summarized as:

托瓦尔德斯回复说,与其回滚,不如修复 clear_refs 实现并包含一个草案补丁(他后来将这个补丁合并到 5.11-rc4 中,作为提交 9348b73c2e1bf)。这个想法是,如果页面看起来被固定了, clear_refs 处理将简单地不写保护其页表项,因此,稍后 CPU 的写操作不会导致页面错误和 COW 中断。使用 soft-dirty 机制的用户会看到页面始终是脏的,这在允许 DMA 流量写入页面的情况下应该是一个公平的结果。然后他扩展了关于如何处理 DMA 传输页面固定和写保护的通用规则。可以总结为:

When considering whether to just allow a write on a write-protected PTE, or to instead create a copy, and it is not certain that the process is the exclusive owner of the page, always create a copy. The elevated page reference count is an indication of not being an exclusive owner of the page.

在考虑是否只是允许在写保护的 PTE 上写入,或者创建一个副本,并且不确定进程是否是页面的唯一所有者时,始终创建一个副本。提升的页面引用计数表明不是页面的唯一所有者。

If the page is pinned with a cache-coherent GUP (such as for write DMA transfers) the page-table entry has to also be writable. It doesn't make sense to make it read-only if a DMA transfer can write to the page anyway.

如果页面被缓存一致性 GUP(例如用于写 DMA 传输)固定,那么页表项也必须是可写的。如果 DMA 传输可以写入页面,那么将其设为只读是没有意义的。

While these rules are conceptually simple, the devil is still in the details. What if the DMA transfers are only meant for reading? Gunthorpe mentioned a virtual-machine, live-migration scenario where the machine's memory is pinned by RDMA and then clear_refs is used, presumably to detect which pages have to be migrated again because their contents changed. If clear_refs processing refuses to write-protect the pinned pages and leaves them marked as dirty, this scheme becomes inefficient, as all of the virtual machine's memory will appear to be dirty at all times. And, unlike clear_refs, userfaultfd() has not been patched at all, so in combination with RDMA, unexpected COW breaks would occur instead.

虽然这些规则在概念上很简单,但细节仍然很关键。如果 DMA 传输仅用于读取呢?Gunthorpe 提到了一个虚拟机、活体迁移场景,其中机器的内存被 RDMA 固定,然后使用 clear_refs ,据推测是为了检测哪些页面需要再次迁移,因为它们的内容发生了变化。如果 clear_refs 处理拒绝写入保护固定的页面,并将它们标记为脏,这种方案就会变得低效,因为虚拟机的所有内存在所有时间都会显示为脏。而且与 clear_refs 不同, userfaultfd() 完全没有被修补,因此与 RDMA 结合时,可能会发生意外的 COW 中断。

To fix these scenarios within the new COW rules, there might be better heuristics possible for determining exclusive ownership than the non-elevated page reference count or the PTE being writable. Similarly, the test for whether a page is pinned can have false positives — the function is called page_maybe_dma_pinned() after all. Several ideas were floated, such as adding a new page flag or more precise counting of both DMA pins and long-term pins, where the sub-counters would be carved out of the bits used for the general reference count. Neither idea would be easy to implement, given how packed struct page already is and the presence of known attacks for elevating the reference count and thus risking a denial of service attack if the limit on the reference count is reduced. But the discussion seems to have wound down without any concrete patches. The last statement, in the last message of the thread, from David Hildenbrand, sums it up well: "Complicated problem :)"

要在新的 COW 规则中修复这些场景,可能存在比非提升页面引用计数或 PTE 可写性更好的启发式方法来确定独占所有权。类似地,检查页面是否被固定也可能出现假阳性——毕竟所有函数都会在 page_maybe_dma_pinned() 后调用。提出了几个想法,例如添加一个新的页面标志或更精确地统计 DMA 固定和长期固定的数量,其中子计数器将是从用于通用引用计数的位中划分出来的。考虑到 struct page 已经非常拥挤,以及存在已知攻击可以提升引用计数,从而在减少引用计数限制时可能导致拒绝服务攻击,这两种想法都很难实现。但讨论似乎已经平息,没有任何具体的补丁。线程最后一条消息中 David Hildenbrand 的最后一句话很好地总结了这一点:"复杂的问题 :)"

Are we done yet? 我们完成了吗?

So where are we now? In order to fix an information-leak vulnerability with arguably limited potential for exploitation outside of Android, the 5.8 kernel was released with a major change to the COW mechanism. Due to the reported bugs, another major change was done in the 5.9 kernel and is now part of the 5.10 LTS series. More bugs have been fixed in 5.11 and a fix for the userfaultfd() TLB flushing issue seems to be on the way. However, some scenarios for using RDMA with either the soft-dirty or userfaultfd() mechanism may now be broken.

那么现在我们处于什么位置?为了修复一个潜在利用范围可能有限的漏洞,5.8 内核发布时对 COW 机制进行了重大更改。由于报告的漏洞,5.9 内核又进行了另一项重大更改,现在已成为 5.10 LTS 系列的一部分。5.11 中修复了更多漏洞,而 userfaultfd() TLB 刷新问题的修复似乎即将到来。然而,现在使用软脏或 userfaultfd() 机制进行 RDMA 的场景可能会出现故障。

The current COW implementation is based on sound principles, and hopefully the worst corner cases have now been ironed out. So, as a result, we might have gotten to a better and more future-proof model for the copy-on-write mechanism than we had before the 5.8 kernel. Yet, given the history of this area, it would not be at all surprising to see more bug reports pop up in the future.

当前的 COW 实现基于可靠的原理,希望最糟糕的边界情况现在已经得到解决。因此,作为结果,我们可能已经得到了比 5.8 内核之前更好的、更面向未来的写时复制机制模型。然而,考虑到这个领域的历史,未来出现更多漏洞报告一点也不令人意外。

Somewhat ironically, the original vulnerability that triggered the whole ordeal is still exploitable when transparent huge pages are in use. At this point, a fix targeted just for vmsplice() to break COW immediately might be the safest option, especially for backporting to older LTS kernels. However, the unfixed THP vulnerability might also be a sign that transparent huge pages do not actually follow the new COW model created for base pages, and if it's not feasible to adjust them (reference counting for THPs is a complicated topic on its own), this might not be the end of the story.

有点讽刺的是,当透明大页启用时,引发整个事件的原初漏洞仍然可被利用。此时,专门针对 vmsplice() 进行修复以立即破坏 COW 可能是最安全的选择,特别是对于回滚到旧版长期支持内核的情况。然而,未修复的透明大页漏洞也可能表明透明大页并未遵循为基本页创建的新 COW 模型,如果调整它们不可行(参考计数对于透明大页本身就是一个复杂话题),这或许并非故事的结局。