
March 16, 2026
⯀
10
min
Multithreaded apps often face memory bottlenecks that limit performance as more threads or CPU cores are added. These issues can cause slowdowns, low CPU utilization, and poor scaling. Key challenges include lock contention, cache coherency problems, and memory fragmentation. Here's how you can address them:
Adding threads might seem like a straightforward way to improve performance, but it often leads to unexpected slowdowns. Threads can end up waiting, CPUs stay underutilized, and response times worsen. This happens due to three primary memory-related issues.
Many standard memory allocators, like libc, glibc, or the Windows C runtime, rely on a single global lock to manage their internal structures. This design forces threads to wait during every malloc() or free() call. In highly threaded applications, this waiting can consume up to 98% of a thread's runtime.
"If your application does not scale on new multiprocessor, multicore, multithread hardware, the problem might be lock contention in the memory allocator."
- Rickey C. Weisner, Oracle Solaris Technical Expert
As more cores are added, the problem becomes even more pronounced. Instead of speeding up, applications using standard allocators on multiprocessor systems can slow down by a factor of 10. In some cases, CPU utilization may hover around 25%, even with hundreds of active threads, because so many threads are stuck waiting for the allocator.
Even so-called "lock-free" methods, like Compare-and-Swap (CAS), face challenges. When multiple threads attempt to update the same memory location simultaneously, CAS operations often fail, forcing threads into retry loops. For example, performing 1 million increments across four threads takes 0.73 seconds with CAS-based atomic increments, compared to just 0.23 seconds using specialized instructions.
But lock contention is only one piece of the puzzle. Cache coherency and memory bandwidth issues also play a major role in limiting performance.
Modern CPUs rely on the MESI protocol (Modified, Exclusive, Shared, Invalid) to keep caches synchronized across cores. When one core writes to a cache line, it must invalidate that line in all other cores. This process, known as cache thrashing, creates a constant battle for cache ownership, clogs the internal bus, and drags down performance.
False sharing amplifies the problem. This occurs when threads modify different variables that happen to reside on the same 64-byte cache line. Since hardware can't track individual bytes, it invalidates the entire cache line for all cores. On a Zen4 system with 16 cores, having 32 threads access the same shared cache line can take 300 times longer than single-threaded access.
| Cache State (MESI) | Description |
|---|---|
| Modified | Data is in the cache and has been altered (dirty) |
| Exclusive | Data is in the cache and matches main memory (clean) |
| Shared | Data may exist in multiple caches and matches main memory |
| Invalid | Data is no longer valid and must be reloaded |
While cache-related issues are a significant bottleneck, another challenge lies in how memory is managed over time - fragmentation.
Fragmentation can quietly erode the performance of long-running applications. Even when gigabytes of RAM are technically free, scattered and non-contiguous memory blocks can render much of it unusable.
"Having free memory and having usable free memory are two completely different things."
- Sohail x Codes
Fragmentation forces memory allocators to work harder to find suitable blocks, wasting CPU cycles. It also spreads data across memory in a way that increases cache misses, further degrading performance. On Linux systems, fragmentation can even trigger "direct compaction", where the kernel pauses processes to defragment memory during allocation, leading to noticeable latency spikes.
Over time, these issues compound. Applications may experience slower response times, gradual performance drops, and allocation failures, even though total memory usage appears stable. Two types of fragmentation contribute to this problem: internal fragmentation, where allocated blocks waste space, and external fragmentation, where free memory is scattered into unusable pieces.
Performance Comparison of Memory Allocators in Multithreaded Applications
Memory bottlenecks in multithreaded applications can be tackled with a combination of tools and techniques. The main strategies include using scalable memory allocators, optimizing thread scheduling, and ensuring balanced memory usage across threads.
Standard system allocators like glibc's ptmalloc often struggle in high-concurrency environments. Replacing them with scalable memory allocators can significantly improve performance - sometimes without requiring any code changes.
These allocators rely on thread-local or per-CPU caches to handle most memory operations without needing global locks. For instance, TCMalloc can execute a malloc/free operation in about 50 nanoseconds, compared to the 300 nanoseconds typical of ptmalloc2. In multithreaded scenarios, TCMalloc achieves 7–9 million operations per second for small allocations, while ptmalloc2 maxes out at around 4 million and drops below 1 million as the allocation size grows.
"TCMalloc is faster than the glibc 2.3 malloc... and other mallocs that I have tested."
- Sanjay Ghemawat, Fellow, Google
In a study conducted in July 2020, researcher Matthew A. Moreno profiled the DISHTINY software (version 53vgh) on Michigan State University's ICER lac-247 node. Comparing the default glibc allocator, Hoard, and Microsoft's mimalloc across 1, 4, and 16 threads, mimalloc consistently performed better at higher thread counts. It reduced runtime by 10–20% and scaled more effectively as synchronization overhead increased.
Swapping in a scalable allocator is often straightforward. Most can be integrated into applications using the LD_PRELOAD environment variable, pointing to the allocator's .so or .dll file. This approach avoids the need for recompilation.
| Allocator | Small Object Speed (malloc/free) | Primary Mechanism | Key Benefit |
|---|---|---|---|
| TCMalloc | ~50 ns | Per-CPU/Thread Caches | Low contention, fast small-object handling |
| ptmalloc2 | ~300 ns | Per-thread arenas | Standard system default; prone to memory blowup |
| Hoard | N/A | Global/Private Heaps | Reduces false sharing and memory blowup |
| mimalloc | N/A | Thread-local shards | High performance at high thread counts |
While scalable allocators are a great starting point, reducing contention further requires smarter thread scheduling and backoff strategies.
Even with advanced allocators, threads can still compete for shared resources. Backoff techniques help by introducing small delays between access attempts, reducing the likelihood of cache line invalidations triggered by the MESI coherency protocol.
"Reducing demand to the shared memory location decreases memory contention and allows for other threads to make quicker progress. This improves aggregated throughput and fairness."
- Ola Liljedahl, Arm Community
For short delays, spin-waiting in user space is often more efficient than invoking operating system calls like usleep. On modern Arm architectures (v8.7+), the WFET instruction allows threads to enter a low-power state during backoff, improving energy efficiency while waiting.
Optimized thread scheduling also plays a significant role. Techniques like "Static Thread Initialization" and assigning tasks that fit within a thread's private L2 cache can lead to performance boosts of up to 2.6× and 1.47×, respectively.
To maintain consistent performance, balancing memory usage across threads is equally important. Thread-local storage (TLS) assigns dedicated memory pools to individual threads, enabling allocations and deallocations without synchronization locks.
Modern allocators like TCMalloc have evolved to support per-CPU caching instead of per-thread caches, which helps control memory overhead even as the number of threads increases.
Preventing false sharing is another critical step. By aligning data modified by different threads to separate cache lines (typically 64 bytes), you can avoid unnecessary performance penalties. In C++17, for instance, the alignas(std::hardware_destructive_interference_size) directive provides a portable way to prevent false sharing.
For cases where one thread allocates memory and another frees it, implementing "pending free requests lists" can minimize synchronization overhead during cross-thread deallocation. Additionally, grouping objects with similar lifetimes into dedicated memory pools reduces fragmentation and simplifies cleanup.
"The benefits of using a scalable memory allocator can easily be a 20-30% performance, and we have even heard of 4X program performance in extreme cases by simply relinking with a scalable memory allocator."
- Springer
Mobile apps operate under strict memory limitations - typically starting with 24–50 MB and peaking around 200–500 MB - making efficient memory management a top priority. On Android, the generational heap organizes objects by lifespan, which helps optimize the frequency of garbage collection.
To keep animations and interactions smooth at 60 FPS, each frame must render within 16.67 milliseconds. A mere 4-millisecond pause caused by garbage collection can eat up 25% of this time, leading to visible disruptions.
Frequent memory allocation and deallocation in performance-critical loops not only trigger garbage collections but also cause UI stutters and drain battery life. One effective solution is object pooling, which minimizes memory churn. However, this approach demands careful synchronization in multithreaded environments.
Android offers lifecycle-aware tools to further optimize memory usage. The onTrimMemory() callback allows apps to release non-essential resources - like bitmap caches or database connections - when the app is hidden or system resources are tight. Additionally, developers can use ActivityManager.getMemoryInfo() to check the device's memory state before starting resource-heavy tasks.
Replacing memory-heavy classes such as HashMap with Android-specific containers like SparseArray, SparseBooleanArray, or LongSparseArray can also cut down on autoboxing and unnecessary object creation. For cross-platform solutions like React Native, leveraging the JavaScript Interface (JSI) enables zero-copy data transfers (e.g., ArrayBuffers) between JavaScript and native C++ code, avoiding redundant memory usage.
While mobile apps face similar memory challenges as desktop apps, their tighter constraints demand more targeted solutions. Addressing these issues requires a mix of system-level optimizations and mobile-specific techniques to effectively manage multithreaded operations.
Frameworks like Kotlin, Swift, and Android's WorkManager are designed to tackle the unique challenges of mobile development, offering tools to streamline multithreading and reduce memory demands. Kotlin's coroutines, Kotlin/Native's shared heap, Swift's Grand Central Dispatch, and Android's WorkManager provide efficient ways to handle asynchronous tasks while keeping memory usage in check.
Static dependency injection is another way to cut down runtime overhead. When it comes to data serialization, using "lite" versions of Protocol Buffers can significantly reduce both code size and memory consumption.
At Dots Mobile, we rely on frameworks like Swift, Kotlin, and Python to create high-performing, scalable apps that excel on resource-constrained devices. From the very start, our development process includes memory profiling to ensure that even AI-powered and complex mobile apps deliver seamless user experiences under heavy loads.
In multithreaded applications, memory bottlenecks can severely impact performance. For instance, threads in standard allocators may spend as much as 98% of their time waiting for locks.
To address this, consider switching to scalable memory allocators like Hoard to minimize lock contention. Additionally, implementing memory pooling can cut allocation rates by 50–90%, while using cache alignment helps avoid false sharing.
"Memory optimization in C++ isn't about clever tricks. It's about understanding how modern hardware actually works, then structuring your code to cooperate with it rather than fight against it." - Jakub Jirák, Performance Engineer
By adopting multithread-optimized allocators, you can achieve performance boosts exceeding 40× on systems with many threads. Similarly, memory pooling can enhance throughput by 10–30% in applications with high allocation demands. On mobile devices running at 60 FPS, these optimizations can eliminate stutters and ensure smooth animations.
These strategies outline a clear path to improving memory performance in multithreaded environments.
Start by profiling your application with tools like plockstat, Java Flight Recorder, or Android Studio's Memory Profiler to identify memory bottlenecks. Once you've pinpointed the issue, apply targeted solutions such as thread-local allocation buffers, optimized data layouts, or NUMA-aware initialization.
Building high-performance apps requires a deep understanding of memory architecture and platform-specific constraints. At Dots Mobile, we incorporate memory profiling from the very beginning, using frameworks like Swift, Kotlin, and Python to develop applications that handle multithreaded workloads efficiently. Whether you're creating an AI-driven fitness app or a robust enterprise solution, collaborating with experts who understand these challenges ensures your app performs flawlessly on modern hardware.
To figure out if allocator lock contention is behind your slowdown, take a close look at how much contention occurs during memory allocation and deallocation. In multithreaded applications, heavy contention can drag down performance. Pay attention to how threads are competing for locks during these operations to identify the root of the problem.
To spot false sharing, you can use tools designed to detect cache line conflicts or analyze cache misses with profiling tools. Addressing the issue involves aligning data to cache line boundaries or inserting padding to ensure threads don't share the same cache line. These adjustments can resolve performance problems linked to false sharing in multithreaded applications.
To cut down on frequent allocations and deallocations of short-lived objects, memory pooling can be a game-changer. By reusing objects instead of constantly creating and destroying them, you reduce the strain on the garbage collector and lower the risk of memory fragmentation. This approach shines in high-performance situations where temporary objects are created in large numbers.
On the other hand, switching allocators is a more tailored solution for managing memory. It allows you to fine-tune allocation strategies to match specific workloads. This can be particularly useful when dealing with large or unpredictable memory patterns, helping to optimize performance and reduce fragmentation at a deeper level.