Skip to content

Commit 51fcce5

Browse files
npiggingregkh
authored andcommitted
sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
[ Upstream commit bafb056 ] The de facto (and apparently uncommented) standard for using an mm had, thanks to this code in sparc if nothing else, been that you must have a reference on mm_users *and that reference must have been obtained with mmget()*, i.e., from a thread with a reference to mm_users that had used the mm. The introduction of mmget_not_zero() in commit d2005e3 ("userfaultfd: don't pin the user memory in userfaultfd_file_create()") allowed mm_count holders to aoperate on user mappings asynchronously from the actual threads using the mm, but they were not to load those mappings into their TLB (i.e., walking vmas and page tables is okay, kthread_use_mm() is not). io_uring 2b188cc ("Add io_uring IO interface") added code which does a kthread_use_mm() from a mmget_not_zero() refcount. The problem with this is code which previously assumed mm == current->mm and mm->mm_users == 1 implies the mm will remain single-threaded at least until this thread creates another mm_users reference, has now broken. arch/sparc/kernel/smp_64.c: if (atomic_read(&mm->mm_users) == 1) { cpumask_copy(mm_cpumask(mm), cpumask_of(cpu)); goto local_flush_and_out; } vs fs/io_uring.c if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) || !mmget_not_zero(ctx->sqo_mm))) return -EFAULT; kthread_use_mm(ctx->sqo_mm); mmget_not_zero() could come in right after the mm_users == 1 test, then kthread_use_mm() which sets its CPU in the mm_cpumask. That update could be lost if cpumask_copy() occurs afterward. I propose we fix this by allowing mmget_not_zero() to be a first-class reference, and not have this obscure undocumented and unchecked restriction. The basic fix for sparc64 is to remove its mm_cpumask clearing code. The optimisation could be effectively restored by sending IPIs to mm_cpumask members and having them remove themselves from mm_cpumask. This is more tricky so I leave it as an exercise for someone with a sparc64 SMP. powerpc has a (currently similarly broken) example. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200914045219.3736466-4-npiggin@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
1 parent e3783eb commit 51fcce5

1 file changed

Lines changed: 14 additions & 51 deletions

File tree

arch/sparc/kernel/smp_64.c

Lines changed: 14 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
10391039
* are flush_tlb_*() routines, and these run after flush_cache_*()
10401040
* which performs the flushw.
10411041
*
1042-
* The SMP TLB coherency scheme we use works as follows:
1043-
*
1044-
* 1) mm->cpu_vm_mask is a bit mask of which cpus an address
1045-
* space has (potentially) executed on, this is the heuristic
1046-
* we use to avoid doing cross calls.
1047-
*
1048-
* Also, for flushing from kswapd and also for clones, we
1049-
* use cpu_vm_mask as the list of cpus to make run the TLB.
1050-
*
1051-
* 2) TLB context numbers are shared globally across all processors
1052-
* in the system, this allows us to play several games to avoid
1053-
* cross calls.
1054-
*
1055-
* One invariant is that when a cpu switches to a process, and
1056-
* that processes tsk->active_mm->cpu_vm_mask does not have the
1057-
* current cpu's bit set, that tlb context is flushed locally.
1058-
*
1059-
* If the address space is non-shared (ie. mm->count == 1) we avoid
1060-
* cross calls when we want to flush the currently running process's
1061-
* tlb state. This is done by clearing all cpu bits except the current
1062-
* processor's in current->mm->cpu_vm_mask and performing the
1063-
* flush locally only. This will force any subsequent cpus which run
1064-
* this task to flush the context from the local tlb if the process
1065-
* migrates to another cpu (again).
1066-
*
1067-
* 3) For shared address spaces (threads) and swapping we bite the
1068-
* bullet for most cases and perform the cross call (but only to
1069-
* the cpus listed in cpu_vm_mask).
1070-
*
1071-
* The performance gain from "optimizing" away the cross call for threads is
1072-
* questionable (in theory the big win for threads is the massive sharing of
1073-
* address space state across processors).
1042+
* mm->cpu_vm_mask is a bit mask of which cpus an address
1043+
* space has (potentially) executed on, this is the heuristic
1044+
* we use to limit cross calls.
10741045
*/
10751046

10761047
/* This currently is only used by the hugetlb arch pre-fault
@@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
10801051
void smp_flush_tlb_mm(struct mm_struct *mm)
10811052
{
10821053
u32 ctx = CTX_HWBITS(mm->context);
1083-
int cpu = get_cpu();
10841054

1085-
if (atomic_read(&mm->mm_users) == 1) {
1086-
cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
1087-
goto local_flush_and_out;
1088-
}
1055+
get_cpu();
10891056

10901057
smp_cross_call_masked(&xcall_flush_tlb_mm,
10911058
ctx, 0, 0,
10921059
mm_cpumask(mm));
10931060

1094-
local_flush_and_out:
10951061
__flush_tlb_mm(ctx, SECONDARY_CONTEXT);
10961062

10971063
put_cpu();
@@ -1114,17 +1080,15 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
11141080
{
11151081
u32 ctx = CTX_HWBITS(mm->context);
11161082
struct tlb_pending_info info;
1117-
int cpu = get_cpu();
1083+
1084+
get_cpu();
11181085

11191086
info.ctx = ctx;
11201087
info.nr = nr;
11211088
info.vaddrs = vaddrs;
11221089

1123-
if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
1124-
cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
1125-
else
1126-
smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
1127-
&info, 1);
1090+
smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
1091+
&info, 1);
11281092

11291093
__flush_tlb_pending(ctx, nr, vaddrs);
11301094

@@ -1134,14 +1098,13 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
11341098
void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
11351099
{
11361100
unsigned long context = CTX_HWBITS(mm->context);
1137-
int cpu = get_cpu();
11381101

1139-
if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
1140-
cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
1141-
else
1142-
smp_cross_call_masked(&xcall_flush_tlb_page,
1143-
context, vaddr, 0,
1144-
mm_cpumask(mm));
1102+
get_cpu();
1103+
1104+
smp_cross_call_masked(&xcall_flush_tlb_page,
1105+
context, vaddr, 0,
1106+
mm_cpumask(mm));
1107+
11451108
__flush_tlb_page(context, vaddr);
11461109

11471110
put_cpu();

0 commit comments

Comments
 (0)