🤨Memory Leak Analysis - large objects heap inside CLR 6

tags
windbg
postmortem
.net
type
Post
summary
status
Published
slug
large-objects-heap-inside-of-clr
date
May 12, 2023
This article provides a memory leak analysis sample, which well explains LOH heap recycle opreation designment.
One of my customers observed their web application hosting on an IIS server facing memory leak issue - although they are using a powerful machine to host the application, when inbound requests volume increasing fastly, they can see obvious memory leak issue in Task Manager.
The w3wp.exe - IIS worker process allocated memory volume will increase 1~2 GB in several minutes, and never be released. This symptom triggers GC frequently, requests processing time increasing quickly as well. Sometiem users may need to wait over 1~2 mins for simple actions on the page.
They collect full user dumps at the first time and ask us to help analysis the root cause of the slowness.

Dump Analysis

First, we use !ThreadPool command to take a galance of the perfomace indicators.
notion image
Ok, the CPU utilization is not quiet high, how about the threads status?
notion image
We can just focus on the thread which is costing most of CPU time. The thread was suspended, and is waiting for the GC processing.
In the CLR 6, the server garbage collection can be non-concurrent or background. Concurrent (or background) garbage collection enables managed threads to continue operations during a garbage collection.
Actually I was curious with their machine performance when first see this result, in most of situations CPU utilization could raise to 90% on GC stage, especially for the full GC - recycle Gen 2 virtual memory.
Then I investigate into the memory configuration of the issue status.
0:138> !eeheap -GC Number of GC Heaps: 24 ------------------------------ Heap 0 (000001FD2479A010) generation 0 starts at 0x000001FD2626FBB8 generation 1 starts at 0x000001FD26212738 generation 2 starts at 0x000001FD24DA1000 ephemeral segment allocation context: none segment begin allocated committed allocated size committed size 000001FD24DA0000 000001FD24DA1000 000001FD2649B378 000001FD26F80000 0x16fa378(24093560) 0x21df000(35516416) Large object heap starts at 0x000001FD64DA1000 segment begin allocated committed allocated size committed size 000001FD64DA0000 000001FD64DA1000 000001FD6C311780 000001FD6C312000 0x7570780(123144064) 0x7571000(123146240) Pinned object heap starts at 0x000001FD74DA1000 000001FD74DA0000 000001FD74DA1000 000001FD74EB1CA0 000001FD74EB2000 0x110ca0(1117344) 0x111000(1118208) Allocated Heap Size: Size: 0x8d7b798 (148354968) bytes. Committed Heap Size: Size: 0x9861000 (159780864) bytes. ... TL;DR ------------------------------ Heap 23 (000002063C14F4E0) generation 0 starts at 0x00000205C65B82D0 generation 1 starts at 0x00000205C64BCC88 generation 2 starts at 0x00000205C4DA1000 ephemeral segment allocation context: none segment begin allocated committed allocated size committed size 00000205C4DA0000 00000205C4DA1000 00000205C69702E8 00000205C7237000 0x1bcf2e8(29160168) 0x2496000(38363136) Large object heap starts at 0x0000020604DA1000 segment begin allocated committed allocated size committed size 0000020604DA0000 0000020604DA1000 000002060BAA9928 000002060BAAA000 0x6d08928(114329896) 0x6d09000(114331648) Pinned object heap starts at 0x0000020614DA1000 0000020614DA0000 0000020614DA1000 0000020614DA1018 0000020614DA2000 0x18(24) 0x1000(4096) Allocated Heap Size: Size: 0x88d7c28 (143490088) bytes. Committed Heap Size: Size: 0x91a0000 (152698880) bytes. ------------------------------ GC Allocated Heap Size: Size: 0xbcd8d9c0 (3168328128) bytes. GC Committed Heap Size: Size: 0xcd9ce000 (3449610240) bytes.
😅 A powerful machine, huh? In Windows Server, each CPU physical core will execute a GC thread manage a part of heap, there are 24 heaps - that mean their server has 24 cores. I believe this is enough for most of web apps.
💡
A heap and a dedicated thread to perform garbage collection are provided for each logical CPU in Windows Server, and the heaps are collected at the same time. Each heap contains a small object heap and a large object heap, and all heaps can be accessed by user code. Objects on different heaps can refer to each other.
We can pick one of the heaps and investigate the memory configuration on it. I choose Heap 23 , which the LOH between 0000020604DA1000 and 000002060BAA9928 (highlighted above), as it’s the largest section on the heap.
0:138> !dumpheap -stat 0000020604DA1000 000002060BAA9928 Statistics: MT Count TotalSize Class Name 00007ff8a72b4348 1 116752 System.Reflection.RuntimeMethodInfo[] 00007ff8aa130968 1 131096 MOF.Etimad.Monafasat.Core.Entities.Invitation[] 00007ff8a714b578 1 131096 System.Object[] 000001fd244ede20 94 218496 Free 00007ff8a910e0d8 10 1966320 System.Int64[] 00007ff8a71fd698 27 12048974 System.String 00007ff8a74f3650 18 14931830 System.Byte[] 00007ff8a72971b8 35 84785176 System.Char[] Total 187 objects
Fortunately, the symptom is very clear here, 35 pieces of System.Char arrays are using almost 84MB memory. In the CLR, object size over 85,000 bytes will be taken as a large object and will be archived in large object heap.
Usually, we can use !gcroot to print the object’s GC root, which could point out the method refering the object in program. I tried for several System.Char[] objects but none of them has reference.
That’s not a good signal, no roots does not mean the LOH memory could be collected directly. Because the LOH and generation 2 are collected together, if either one's threshold is exceeded, a generation 2 collection is triggered. If a generation 2 collection is triggered because of the LOH, generation 2 won't necessarily be much smaller after the GC. When there is not much data on Gen 2, this has minimal impact, but if Gen 2 is large, it can cause performance problems if many generation 2 GCs are triggered.
0:138> !gchi GC Heap statistics: Heap Gen0 Gen1 Gen2 LOH ====== =========== ========== =========== ============= Heap0 2,275,264 382,080 21,436,216 123,144,064 Heap1 1,069,536 546,008 18,466,488 61,278,560 Heap2 2,967,056 104,392 20,732,216 114,120,600 Heap3 13,539,056 571,488 22,350,912 47,636,992 Heap4 3,915,584 194,576 17,859,008 48,013,672 Heap5 53,000,992 411,160 19,776,080 41,229,264 Heap6 2,969,664 169,848 18,486,360 132,603,512 Heap7 9,984,640 864,672 21,995,016 57,982,760 Heap8 7,993,704 252,360 19,683,312 101,794,224 Heap9 45,963,704 755,632 18,992,280 108,969,672 Heap10 14,638,752 945,288 20,602,672 102,033,408 Heap11 3,205,504 498,176 22,882,064 182,954,584 Heap12 204,824 312,016 18,824,928 57,607,608 Heap13 21,562,216 438,464 19,549,528 135,356,360 Heap14 3,169,744 249,008 19,721,512 191,905,136 Heap15 3,326,176 368,768 18,584,456 74,043,776 Heap16 13,389,640 864,416 23,784,952 133,585,920 Heap17 41,121,752 919,688 20,857,128 47,720,680 Heap18 84,615,544 536,216 19,535,024 101,803,680 Heap19 5,924,520 604,760 20,491,200 68,423,824 Heap20 7,246,200 490,696 19,832,072 60,752,256 Heap21 5,828,016 484,168 22,270,272 81,899,416 Heap22 3,519,296 1,696,032 24,329,448 110,917,536 Heap23 3,899,416 1,029,704 24,231,048 114,329,896 Total 355,330,800 13,689,616 495,274,192 2,300,107,400 [Segments] [Free Space Fragmentation]
But actually the Gen2 size is very big in here 😅 If many large objects are allocated on a temporary basis and you have a large SOH, you could be spending too much time doing GCs.
This could be the immediate cause of the slowness issue. But how could we identify the root cause in the application and fix it in the application code? We cannot use !gcroot to trace the reference methods, as the large objects here are totally no roots…

Identify the root cause in application

Another way is analysis the objects contents from business logic sides, I was suspecting the System.Char[] objects are used to save some large files. After reviewing several objects boringly I understand what they are.
0:138> !do 00000206091cbca8 Name: System.Char[] MethodTable: 00007ff8a72971b8 EEClass: 00007ff8a7297138 Tracked Type: false Size: 33554456(0x2000018) bytes Array: Rank 1, Number of elements 16777216, Type Char (Print Array) Content: MIMETypeField":".PDF","FileLengthField":"12416876","FileVersionField":null,"ClassField":null,"FolderIdField":null}.BvYmoKPDwgCi9 Fields: None
All of these objects are PDFs 🫠 I really cannot understand why they load the entire PDF in memory without dispose it explitly. I was thought the object lifecycle management is very easy in csharp - you just need to implement the Dispose() method, or encapsure the file operation in using statement.
I also find a library they are using to HTML/PDF transferring is obsoleted as well.
0:138> lm start end module name ... 00000206`3ef70000 00000206`3ef8c000 Swashbuckle_AspNetCore_SwaggerGen (deferred) 00000206`3ef90000 00000206`3efa0000 DinkToPdf (deferred) 00000206`3efa0000 00000206`3f12a000 Hangfire_Core (deferred) ...
I cannot hundred percent to make sure this DinkToPdf library has some issue with its implemetation, but I think there could be some reasons that the owner highlighted this library was not tested with IIS. 🤪
notion image
Thus I suggest the customer start using ASP.NET Core Out-of-Process hosting model on the server, and the memory leak issue does not happen again…
These kind of issues could be detected not only in .NET but also Java. To save the time of postmorterm analysis, please apply some basic knowledge including but not limited to:
  1. Finish what you start when implement some IO opreations. Like close the DB/HTTP connections after using it, and release the occupied memory what the files operations finished.
  1. Stop using incorrect ways to contact strings. Don’t use += for the long contects. Use StringBuilder for such use cases.

Reference


 

lucky_bricks © 2018 - 2024