-
Notifications
You must be signed in to change notification settings - Fork 712
/
UsersGuide.htm
9467 lines (9416 loc) · 553 KB
/
UsersGuide.htm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>PerfView User's Guide</title>
<style>
body {
font-family: Segoe UI,SegoeUI,Segoe WP,Helvetica Neue,Helvetica,Tahoma,Arial,sans-serif;
font-weight: 400;
text-rendering: optimizeLegibility;
-webkit-font-smoothing: antialiased;
}
hr {
border-top: 3px double gray;
}
</style>
</head>
<body>
<!-- ************************************************************************************* -->
<h1>
<a id="UsersGuide">PerfView User's Guide</a>
</h1>
<p>
PerfView is a tool for quickly and easily collecting and viewing both time and memory
performance data. PerfView uses the <a href="http://msdn.microsoft.com/en-us/library/bb968803(v=VS.85).aspx">
Event Tracing for Windows (ETW)
</a> feature of the operating system which can
collect information machine wide a variety of useful events as described in the
<a href="#AdvancedCollection">advanced collection</a> section. ETW is the same powerful
technology the <strong>windows performance group uses almost exclusively</strong>
to track and understand the performance of windows, and the basis for their
<a href="http://msdn.microsoft.com/en-us/performance/default.aspx">Xperf</a> tool.
PerfView can be thought of a simplified and user friendly version
of that tool. In addition PerfView has ability to collect .NET GC Heap information
for doing memory investigation (Even for very large GC heaps). PerfView's ability
to decode .NET symbolic information as well as the GC heap make <strong>
PerfView ideal
for managed code investigations
</strong>.
</p>
<p>
<strong>Deploying and Using PerfView</strong>
</p>
<p>
PerfView was designed to be easy to deploy and use. To deploy PerfView
simply copy the PerfView.exe to the computer you wish to use it on.
No additional files or installation step is needed. PerfView features
are 'self-discoverable'. The initial display is a 'quick
start' guide that leads you through collecting and viewing your first set of
profile data. There is also a built in <a href="#Tutorial">tutorial</a>. Hovering
the mouse over most GUI controls will give you short explanations, and hyperlinks
send you to the most appropriate part of this user's guide. Finally PerfView is
'right click enabled' which means that you want to manipulate data in some
way, right clicking allows you to discover what PerfView's can do for you.
</p>
<p>
PerfView is a V4.6.2 .NET application. Thus you need to have installed
a V4.6.2 .NET Runtime on the machine which you actually run PerfView.
On Windows 10 and Windows Server 2016 has .NET V4.6.2.
On other supported OS you can install .NET 4.6.2 from standalone installer. PerfView is not supported
on Win2K3 or WinXP. While PerfView itself needs a V4.6.2 runtime,
it can collect data on processes that use V2.0 and v4.0 runtimes. On machines that don't
have V4.6.2 or later of the .NET runtime installed, it is also possible to collect ETL data
with another tool (e.g. XPERF or PerfMonitor) and then copy data file to a machine
with V4.6.2 and view it with PerfView.
</p>
<p>
<a id="WhatPerfViewCanDoForYou"><strong>What can PerfView do for you?</strong></a>
</p>
<p>
PerfView was designed to collect and analyze both time and memory scenarios.
</p>
<ol>
<li>
<strong><a id="CPUInvestigation">CPU Investigation</a>:</strong> One of the more useful events
(and one that is turned on
by default) is the 'profile' sampling event. This event samples
the instruction pointer of each of the machine's CPUs every millisecond.
Each sample captures the complete call stack of the thread current executing; giving
very detailed and useful information about what that thread was doing at both high
and low levels of abstraction. PerfView aggregates these stack traces
and presents them in a <a href="#StackViewer">stack viewer</a> that has powerful
grouping operations that make understanding this data significantly simpler than
most profilers. If your application's performance problem
is associated with excessive CPU usage, then PerfView will tell you that and give
you the tools you need to understand exactly what portion of your application is
mis-behaved. See <a href="#StartingAnAnalysis">Starting a CPU Analysis</a> for more
</li>
<li>
<strong>Managed Memory Investigations:</strong> PerfView also has the ability to take a snapshot
of the .NET GC heap. Because these heaps can be very large, PerfView allows control
over how large of a sample is taken, and goes to some trouble to take a representative
sample if the heap is too big to capture in its entirety. It then converts the graph
of objects in the heap into a tree, and displays this in the same <a href="#StackViewer">stack viewer</a>
that was used for CPU investigations. See <a href="#InvestigatingMemoryData">Investigating Memory</a> and
<a href="#StartingAnAnalysisGCHeap">Starting a GC Heap Analysis</a> for more
</li>
<li>
<strong>Response Time Investigations: </strong> By collecting with the 'ThreadTime' option
enough information is collected so that PerfView has the ability to measure what every
thread (blocked or not), gather all the thread time associated with every request and display
it as a tree. This is what the 'Thread Time (with Start-Stop Activities) view is.
See <a href="#MakingServerInvestigationEasy">Making Server Investigation Easy</a> for more.
</li>
<li>
<strong>Wall Clock / Blocked Time Investigations:</strong> If your program is too slow, but it is not consuming
excessive CPU, then it must be blocked waiting on something else (disk network,
...). PerfView can instruct the OS to log events whenever threads sleep or wake
up, and has a display for visualizing where your program is waiting.
See <a href="#BlockedTimeInvestigation">Blocked / Wall Clock Time Investigation</a> for more.
</li>
<li>
<strong> Memory Investigations:</strong> You can also turn on events every time the OS heap memory
allocator allocates or frees an object. Using these events you can see what call
stacks are responsible for the most net unmanaged memory allocations.
See <a href="#InvestigatingMemoryData">Investigating Memory</a> and
<a href="#UnmanagedMemoryAnalysis">Unmanaged Heap Analysis</a> for more.
</li>
<li>
<strong> CPU Investigations:</strong> PerfView has the ability to read the output of the Linux 'Perf Events'
collector that is built into the Linux kernel.
See <a href="#ViewingLinuxData">Viewing Linux Data</a> for more.
</li>
<li>
<strong>Viewing your own hierarchical data in PerfView's stack viewer:</strong> PerfView's stack viewer is
powerful, but it is also very flexible. PerfView defines a very simple XML or JSON format that
it can read into this viewer. This allows you to easily generate data that you can then
view in PerfView's powerful stack viewer.
See <a href="#ViewingExternalData">Viewing External Data</a> for more.
</li>
</ol>
<p>
See also <a href="#ReferenceGuide">PerfView Reference Guide</a>.
</p>
<hr />
<!-- **************** -->
<h3>
<a id="Feedback">Sending feedback / Asking Questions about PerfView</a>
</h3>
<p>
Hopefully the documentation does a reasonably good job of answering your most common
questions about PerfView and performance investigation in general. If you have a
question, you should certainly start by searching the user's guide for information
</p>
<p>
Inevitably however, there will be questions that the docs don't answer, or features
you would like to have that don't yet exist, or bugs you want to report. PerfView
is an <a href="https://github.com/Microsoft/perfview">GitHub open source project</a>
and you should log questions, bugs or other feedback at
</p>
<center>
<a href="https://github.com/Microsoft/perfview/issues"> PerfView Issues </a>
</center>
<p>
If you are just asking a question there is a Label called 'Question' that you can
use to indicate that. If it is a bug, it REALLY helps if you supply enough information
to reproduce the bug. Typically this includes the data file you are operating on.
You can drag small files into the issue itself, however more likely you will need
to put the data file in the cloud somewhere and refer to it in the issue. Finally
if you are making a suggestion, the more specific you can be the better. Large features
are much less likely to ever be implemented unless you yourself help with the implementation.
Please keep that in mind.
</p>
<!-- **************** -->
<hr />
<h3>
<a id="LatestVersion">Getting the latest version of PerfView</a>
</h3>
<p>
You can get the latest version of PerfView by going to the <a href="https://github.com/Microsoft/perfview/blob/main/documentation/Downloading.md">PerfView GitHub Download Page</a>
</p>
<hr />
<hr />
<!-- ********************************** -->
<h2>
<a id="Tutorial">Tutorial of a Time-Based Investigation</a>
</h2>
<p>
See Also <a href="#TutorialGCHeap">Tutorial of a GC Heap Memory Investigation</a>
</p>
<p>
Perhaps the best way to get started is to simply try out the tutorial example.
On windows 7 it is recommended that you doc your help as described in <a href="#HelpTip">help tips</a>.
PerfView comes with two tutorial examples 'built in'.
Also we strongly suggest that any application you write have performance plan as
described in <a href="http://msdn.microsoft.com/en-us/magazine/cc500596.aspx">part1</a>
and <a href="http://msdn.microsoft.com/en-us/magazine/cc507639.aspx">part2</a> of
<a href="http://msdn.microsoft.com/en-us/magazine/cc500596.aspx">
Measure Early and Often
for Performance
</a>.
</p>
<ol>
<li>
<strong>Tutorial.exe </strong>- A simple program that calls 'DateTime.Now'
repeatedly until it detects that at 5 seconds have gone by. To make this example
more interesting, it does this using two mutually recursive methods (RecSpin, and
RecSpinHelper). Each of these helpers spins for a second and then calls the
other helper to spin for the rest of the time. See <a href="Tutorial.cs.txt">Tutorial.cs</a>
for the complete source.
</li>
</ol>
<p>
To run the 'Tutorial' example:
</p>
<ol>
<li>
Click on the 'Run a command' hyperlink on the main page. This will
bring up dialog indicating command to run and the name of the data file to create.
</li>
<li>Enter 'Tutorial.exe' in the 'command' text dialog and hit <enter>. </li>
<li>
Unless you started PerfView from an elevated environment, the operating system will
bring up a user access control to run as administrator (collecting profile data
is a privileged activity). Click OK to accept.
</li>
<li>
At this point it will begin running the command. The Status bar will blink
to indicate that it is working on your command. You can monitor its
progress by hitting the 'Log' button in the lower right corner. After
it has completed it brings up a process selection dialog box. PerfView is asking
which process you are focused on. In this case we are interested in the 'Tutorial'
process, so we should select that. If you are interested in all process there is
a button for that too.
</li>
</ol>
<p>
You can also run the tutorial example by typing '<strong>PerfView run tutorial</strong>'
at the command line. See <a href="#CollectingFromCommandLine">
collecting
data from the command line
</a> for more.
</p>
<p>
After selecting 'Tutorial.exe' as the process of interest, PerfView brings up the
<a href="#StackViewer">stack viewer</a> looking something like this:
</p>
<center>
<img src="images/stackViewer.png" alt="StackView" />
</center>
<p>
This view shows you where CPU time was spent. PerfView took a sample
of where each processor is (including the full stack), every millisecond (see <a href="#UnderstandingPerfData">understanding perf data</a>) and the stack viewer
shows these samples. Because we told PerfView we were only interested
in the Tutorial.exe process this view has been restricted (by '<a href="#IncPatsTextBox">IncPats</a>')
to only show you samples that were spent in that process.
</p>
<p>
It is always best to begin your investigation by looking at the summary information
at the top of the view. This allows you to confirm that indeed the bulk
of your performance problem is related to CPU usage before you go chasing down exactly
where CPU is spent. This is what the summary statistics are for. We
see that the process spent 84% of its wall clock time consuming CPU, which merits
further investigation. Next we simply look at the <a href="#WhenColumn">'When' column</a>
for the 'Main' method in the program.
This column shows how CPU was used for that method (or any method it calls) over
the collection time interval. Time is broken into 32 'TimeBuckets'
(in this case we see from the summary statistics that each bucket was 197 msec long),
and a number or letter represents what % of 1 CPU is used. 9s and As mean
you are close to 100% and we can see that over the lifetime of the main method we
are close to 100% utilization of 1 CPU most of the time.
Areas outside the main program are probably not interesting to use (they deal with
runtime startup and the times before and after process launch), so we probably want
to 'zoom in' to that area.
</p>
<h4>
<a id="ZoomingToARangeOfInterest">Zooming in to a time range of interest</a>
</h4>
<p>
It is pretty common that you are only interested in part of the trace. For
example you may only care about startup time, or the time from when a mouse was
clicked and when the menu was displayed. Thus zooming in is typically
one of first operations you will want to do. zooming in is really just selecting
a region of time for investigation. The region of time is displayed
in the '<a href="#StartTextBox">start</a>' and '<a href="#EndTextBox">end</a>'
textboxes. These can be set in three ways
</p>
<ol>
<li>Manually entering values into the text boxes.</li>
<li>
Selecting two cells (typically the 'First' and 'Last') cells of
a particular method of interest, right clicking and selecting 'SetTimeRange'
</li>
<li>
Selecting a 'When' cell. If you click the cell again, the cell will become
editable, at which point you can select a region a text right click, and select
'SetTimeRange' (or hit Alt-R) to select the time range associated with your
selected characters.
</li>
</ol>
<p>
Try out each of these techniques. For example to 'zoom into'
just the main method, simply drag the mouse over the 'First' and 'Last'
times to select both, right click, and Select Time Range. You can hit
the 'Back' button to undo any changes you made so you can re-select.
Also notice that each text box remembers the last several values of that box, so
you can also 'go back' particular past values by selecting drop down (small
down array to the right of the box), and selecting the desired value.
</p>
<p>
For GUI applications, it is not uncommon to take a trace of the whole run but then
'zoom into' points where the users triggered activity. You can
do this by switching to the '<a href="#CallTreeView">CallTree</a>' tab.
This will show you CPU starting from the process itself. The first line of
is the View is 'Process32 tutorial.exe' and is a summary of the CPU time
for the entire process. The '<a href="#WhenColumn">when</a>' column
shows you CPU for the process over time (32 time buckets). In a GUI
application there will be lulls where no CPU was used, followed by bursts of higher
CPU use corresponding to user actions. These show up in the numbers in the 'when'
column. By clicking on a cell in the 'when' column, selecting a range, right
clicking and selecting SetTimeRange (or Alt-R), you can zoom into one of these 'hot
spots' (you may have to zoom in more than once). Now you have
focused in on what you are interested in (you can confirm by looking at the methods
that are called during that time). This is a very useful technique.
</p>
<p>
For managed applications, you will always want to zoom into the main method before
starting your investigation. The reason is that when profile data is collected,
after Main has exited, the runtime spends some time dumping symbolic information
to the ETW log. This is almost never interesting, and you want to ignore
it in your investigation. Zooming into the Main method will do this.
</p>
<h4>
<a id="ResolvingUnmanagedSymbols">Resolving unmanaged symbols</a>
</h4>
After zooming into the region of interest, if you are doing an unmanaged investigation,
you may need to resolve symbols. Unlike managed code, unmanaged
code stores its symbolic information in external PDB files which need to be downloaded
and matched up. Because this can take a while it is not done by default.
Instead you see question marks in the trace, (like ntdll!?) indicating that PerfView
knows the sample came from ntdll, but it can't resolve the name further.
For many DLLs you will never need to resolve these symbols because you simply don't
care (you don't own or call that code). However if you do care,
you can quickly get the symbols. Simply select a cell with at DLL!? in it,
right click, and select 'Lookup Symbols'. PerfView will then look
up the symbols for that DLL and redraw the screen. Try looking up the
symbols for ntdll by selecting the cell
<ul>
<li>OTHER <<ntdll!?>>;</li>
</ul>
<p>
Right clicking, and select 'Lookup Symbols'. After looking up the symbols it will
become
</p>
<ul>
<li>OTHER <<ntdll!_RtlUserThreadStart>></li>
</ul>
<p>
If you are doing an unmanaged investigation there are probably a handful of DLLs
you will need symbols for. A common workflow is to look at the byname view
and while holding down the CTRL key select all the cells that contain dlls with
large CPU time but unresolved symbols. Then right click -> Lookup
Symbols, and PerfView will look them all up in bulk. See <a href="#SymbolResolution">
symbol
resolution
</a> for more details or if lookup symbols fails.
</p>
<h4>
<strong><a id="TutorialBottomUp">A Bottom Up Investigation</a></strong>
</h4>
<p>
PerfView starts you with the '<a href="#ByNameView">ByName view</a>' for
doing a bottom-up analysis (see also <a href="#StartingAnAnalysis">starting an analysis</a>).
In this view you see every method that was involved in a sample (either a sample
occurred in the method or the method called a routine that had a sample).
Samples can either be exclusive (occurred in within that method), or inclusive (occurred
in that method or any method that method called). By default the
by name view sorts methods based on their exclusive time (see also <a href="#ColumnSorting">Column Sorting</a>).
This shows you the 'hottest' methods
in your program.
</p>
<p>
Typically the problem with a 'bottom-up' approach is that the 'hot'
methods in your program are
</p>
<ol>
<li>Not very hot (use < 5% of CPU)</li>
<li>
Tend to be 'helper' routines (either in your program or in libraries or
the runtime), that are used 'everywhere' and are already well tuned.
</li>
</ol>
<p>
In both cases, you don't want to see these helper routines, but rather the lowest
'semantically interesting' routine. This is where PerfView's
powerful grouping features comes into play. By default PerfView groups
samples by
</p>
<ol>
<li>
Using the <a href="#GroupPatsTextBox">GroupPats</a> 'Just my code'
pattern to form two groups. The first group is any method in any module that
is in the same directory (recursively) as the 'exe' itself.
This is the 'my code' group and these samples are left alone.
Any sample that is NOT in that first group is in the 'OTHER' group.
These samples are groups according to the method that was called to enter the group.
</li>
<li>
Using the <a href="#FoldPercentTextBox">Fold %</a> feature. This is
set to 1, which means that any method that has fewer than 1% of the samples (inclusively)
in the 'byname' view (that over all the sampled indicated int the summary at the top of the view)
is not 'interesting'
and should not be shown. Instead its samples are folded (inlined), into its
caller.
</li>
</ol>
<p>
For example, the top line in the ByName view is
</p>
<ul>
<li>OTHER <<mscorlib!System.DateTime.get_Now()>></li>
</ul>
<p>
This is an example of an '<a href="#EntryGroups">entry group</a>'.
'OTHER' is the group's name and mscorlib!System.DateTime.get_Now() is
the method that was called that entered the group. From that point on
any methods that get_Now() calls <strong>that are within that group</strong> are
not shown, but rather their time is simply accumulated into this node.
Effectively this grouping says 'I don't want to see the internal workings
of functions that are not my code, but I do want see public methods I used to call
that code. To give you an idea of how useful this feature is,
simply turn it off (by clearing the value in the 'GroupPats' box), and view
the data. You will see many more methods with names of internal functions
used by 'get_Now' which just make your analysis more difficult. (You
can use the 'back' button to quickly restore the previous group pattern).
</p>
<p>
The other feature that helps 'clean up' the bottom-up view is the
<a href="#FoldPercentTextBox">Fold %</a> feature. This feature will
cause all 'small' call tree nodes (less than the given %) to be automatically
folded into their parent. Again you can see how much this feature helps by
clearing the textbox (which means no folding). With that feature off,
you will see many more entries that have 'small' amounts of time.
These small entries again tend to just add 'clutter' and make investigation
harder.
</p>
<h4>More Folding</h4>
Because of the grouping and folding that PerfView did for you, you can quickly see
that 'DateTime.get_Now()' is the 'hot' method (74.6% of all samples).
However also note that PerfView did not do a 'perfect' job.
We notice that the view has groups <ntdll!?> and <ntoskrln!?> which
are two important operating system DLLs take up 9.5% and 2% of the CPU and knowing
just some function in the DLL was called is not terribly useful. We
have two choices<ol>
<li>
Resolve the symbols for these DLLs so that we have meaningful names.
See <a href="#SymbolResolution">symbol resolution</a> for more.
</li>
<li>Fold these entries away. </li>
</ol>
<p>
A quick way of accomplishing (2) is to add the pattern '!?' . This
pattern says to fold away any nodes that don't have a method name. See
<a href="#FoldPatsTextBox">foldPats textbox</a> for more. This leaves us with very
'clean' function view that has only semantically relevant nodes in it.
</p>
<p>
<strong>
Review: what all this time selection, grouping and folding is for?
</strong>
</p>
<p>
The first phase of a perf investigation is forming a 'perf model'
The goal is it assign times to SEMANTICALLY RELEVANT nodes (things the programmer
understands and can do something about). We do that by either forming
a semantically interesting group and assigning nodes to it, or by folding the node
into an existing semantically relevant group or (most commonly) leveraging entry
points into large groups (modules and classes), as handy 'pre made' semantically
relevant nodes. The goal is to group costs into a relatively small number
(< 10) of SEMANTICALLY RELEVANT entries. This allows you to reason about whether
that cost is appropriate or not, (which is the second phase of the investigation).
</p>
<h4>Broken Stacks</h4>
<p>
One of the nodes that is left is a node called 'BROKEN'. This is a
special node that represents samples whose stack traces were determined to be incomplete
and therefore cannot be attributed properly. As long as this number
is small (< a few %) then it can simply be ignored. See <a href="#BrokenStacks">broken stacks</a> for more.
</p>
<h4>Time and Percentage.</h4>
<p>
PerfView displays both the inclusive and exclusive time as both a metric (msec)
as well as a % because both are useful. The percentage gives you a good
idea of the relative cost of the node, however the absolute value is useful because
it very clearly represents 'clock time' (e.g. 300 samples represent 300
msec of CPU time). The absolute value is also useful because when
the value gets significantly less than 10 it becomes unreliable (when you
have only a handful of samples they might have happened 'by pure chance'
and thus should not be relied upon.
</p>
<h4>
<a id="TutorialTopDown">CallTree View (top-down investigations))</a>
</h4>
<p>
The bottom up view did an excellent job of determining that the get_Now() method
as well as the 'SpinForASecond' consume the largest amount of time and thus
are worth looking at closely. This corresponds beautify
to our expectations given the source code in <a href="Tutorial.cs.txt">Tutorial.cs</a>.
However it can also be useful to understand where CPU time was consumed from the
top down. This is what the <a href="#CallTreeView">CallTree view</a> is for.
Simply by clicking the 'CallTree' tab of the stack viewer will bring
you to that view. Initially the display only shows the root node, but
you can open the node by clicking on the check box (or hitting the space bar). This
will expand the node. As long as a node only has one child, the child
node is also auto-expanded, to save some clicking. You can also
right click and select 'expand-all' to expand all nodes under the selected
node. Doing this on the root node yields the following display
</p>
<center>
<img src="images/CallTreeView.png" alt="CallTreeView" />
</center>
<p>
Notice how clean the call tree view is, without a lot of 'noise' entries.
In fact this view does a really good job of describing what is going on.
Notice it clearly shows the fact that Main calls 'RecSpin, which runs for 5
seconds (from 894ms to 5899msec) consuming 4698 msec of CPU while doing so (The
CPU is not 5000msec because of the overheads of actually collecting the profile
(and other OS overhead which is not attributed to this process as well as broken
stacks), which typically run in the 5-10% range. In this case it seems
to be about 6%). The 'When' column also clearly shows how one
instance of RecSpin runs SpinForASecond (for exactly a second) and then calls a
RecSpinHelper which does consumes close to 100% of the CPU for the rest of the time.
. The call Tree is a wonderful top-down synopsis.
</p>
<h4>Getting a 'coarser' view</h4>
<p>
All of the filtering and grouping parameters at the top of the view affect any of
the view (byname, caller-callee or CallTree), equally. We can
use this fact and the 'Fold %' functionality to get an even coarser view
of the 'top' of the call tree. With all nodes expanded, simply
right click on the window and select 'Increase Fold %' (or easier hit the
F7 key). This increases the number it the Fold % textbox by 1.6X.
By hitting the F7 key repeatedly you keep trimming down the 'bottoms' of
the stacks until you only see only the methods that use a large amount of CPU time.
The following image shows the CallTreeView after hitting F7 seven times.
</p>
<center>
<img src="images/PrunedCallTreeView.png" alt="CallTreeView" />
</center>
<p>
You can restore the previous view by either using the 'Back' button, the
Shift-F7 key (which decreases the Fold%) or by simply selecting 1 in the Fold% box
(e.g. from the drop down menu).
</p>
<h4>The Caller-Callee view</h4>
<p>
Getting a course view of the tree is useful but sometimes you just want to restrict
your attention to what is happening at a single node. For example, if
the inclusive time for BROKEN stacks is large, you might want to view the nodes
under 'BROKEN' stacks to get an idea what samples are 'missing'
from their proper position in the call tree. you can do this easily
by viewing the BROKEN node in the Caller-callee view. To do this right
click on the BROKEN node, and select Goto -> Caller-callee (or type Alt-C). Because
so few samples are in our trace are BROKEN this node is not very interesting. By
setting Fold % to 0 (blank) you get the following view
</p>
<center>
<img src="images/CallerCalleeView.png" alt="CallerCalleeView" />
</center>
<p>
The view is broken in to three grids. The middle piece shows the 'current
node', in this case 'BROKEN'. The top grid shows all nodes
that call into this focus node. In the case of BROKEN nodes are only
on one thread. The bottom graph shows all nodes that are
called by 'BROKEN' sorted by inclusive time. We can see that
most of the broken nodes came from stacks that originated in the 'ntoskrnl'
dll (this is the Windows OS Kernel) To dig in more we would first
need to resolve symbols for this DLL. See <a href="#SymbolResolution">symbol resolution</a>
for more.
</p>
<h4>
<a id="TutorialDrillingIntoGroups">Drilling into Groups (Ungrouping)</a>
</h4>
<p>
While groups are a very powerful feature for understanding the performance of your
program at a 'coarse' level, inevitably, you wish to 'Drill into'
those groups and understand the details of PARTICULAR nodes in detail.
For example, If we were a developer responsible for the DateTime.get_Now(),
we would not be interested in the fact that it was called from 'SpinForASecond'
routine but what was going on inside. Moreover we DON'T want to
see samples from other parts of the program 'cluttering' the analysis of
get_Now(). This is what the 'Drill Into' command is for.
If we go back to the 'ByName' view and select the 3792 samples 'Inc'
column of the 'get_Now' right click, and select 'Drill Into', it
brings a new window where ONLY THOSE 3792 samples have been extracted.
</p>
<p>
Initially Drilling in does not change any filter/grouping parameters.
However, now that we have isolated the samples of interest, we are free to change
the grouping and folding to understand the data at a new level of abstraction. Typically
this means ungrouping something. In this case we would like to see the detail of
how mscorlib!get_Now() works, so we want to see details inside mscorlib. To do this
we select the 'mscorlib!DateTime.get_Now() node, right click, and select 'Ungroup
Module'. This indicates that we wish to ungroup any methods that
were in the 'mscorlib' module. This allows you to see the 'inner
structure' of that routine (without ungrouping completely) The result is the
following display
</p>
<center>
<img src="images/Ungrouped.png" alt="Ungrouped" />
</center>
<p>
At this point we can see that most of the 'get_Now' time is spend in a function
called 'GetUtcOffsetFromUniversalTime' and 'GetDatePart'
We have the full power of the stack viewer at our disposal, folding, grouping, using
CallTree or caller-callee views to further refine our analysis. Because
the 'Drill Into' window is separate from its parent, you can treat is as
'disposable' and simply discard it when you are finished looking at this
aspect of your program's performance.
</p>
<p>
In the example above we drilled into the inclusive samples of method. However
you can also do the same thing to drill into exclusive samples.
This is useful when user callbacks or virtual functions are involved.
Take for example a 'sort' routine that has internal helper functions.
In that case it can be useful to segregate those samples that were part of the nodes
'internal helpers' (which would be folded up as exclusive samples of 'sort')
from those that were caused by the user 'compare' function (which would
typically not be grouped as exclusive samples because it crossed a module boundary).
By drilling into the exclusive samples of 'sort' and then ungrouping, you
get to see just those samples in 'sort' that were NOT part of the user callback.
Typically this is EXACTLY what the programmer responsible for the 'sort'
routine would want to see.
</p>
<h4>
<a id="GotoSource">Viewing Source (Line level analysis)</a>
</h4>
<p>
Once the analysis has determined methods are potentially inefficient, the next step
is to understand the code enough to make an improvement. PerfView helps with this
by implementing the 'Goto Source' functionality. Simply select a cell with a method
name in it, right click and choose <a href="#SourceCodeLookup">Goto Source</a> (or
use Alt-D (D for definition)). PerfView with then attempt to look up the source code
and if successful will launch a text editor window. For example, if you select the
'SpinForASecond' cell in the ByName view and select Goto Source the following window
is displayed.
</p>
<center>
<img src="images/SourceCode.png" alt="Ungrouped" />
</center>
<p>
As you can see, the particular method is displayed and each line has been prefixed
with the cost (in this case CPU MSec) spent on that line. in this view it shows
4.9 seconds of CPU time were spent on the first line of the method.
</p>
<h5>
Caveats with Source code
</h5>
<p>
Unfortunately, prior to V4.5 of the .NET Runtime, the runtime did not emit enough
information into the ETL file to resolve a sample down to a line number (only to
a method). As a result while PerfView can bring up the source code, it can't accurately
place samples on particular lines unless the code was running on V4.5 or later.
When PerfView does not have the information it needs it simply attributes all the
cost to the first line of the method. This is in fact what you see in the example
above. If you run your example on a V4.5 runtime, you would get a more interesting
distribution of cost. This problem does not exist for native code (you will get
line level resolution). Even on old runtime versions, however, you at least have
an easy way to navigate to the relevant source.
</p>
<p>
PerfView finds the source code by looking up information in the PDB file associated
with the code. Thus the first step is that PerfView must be able to find the PDB
file. By default most tools will place the complete path of the PDB file inside
the EXE or DLL it builds, which means that if you have not moved the PDB file (and
are on the machine you built on), then PerfView will find the PDB. It then looks
in the PDB file which contain the full path name of each of the source files and
again, if you are on the machine that built the binary then PerfView will find the
source. So if you run on the same machine you build on, it 'just works'.
</p>
<p>
However it is common to not run on the machine you built on, in which case PerfView
needs help. PerfView follows the standard conventions for other tools for locating
source code. In particular if the _NT_SYMBOL_PATH variable is set to a semicolon
separated list of paths, it will look in those places for the PDB file. In addition
if _NT_SOURCE_PATH is set to a semicolon separated list of paths, it will search
for the source file in subdirectories of each of the paths. Thus setting these environment
variables will allow PerfView's source code feature to work on 'foreign' machines.
You can also set the _NT_SYMBOL_PATH and _NT_SOURCE_PATH inside the GUI by using
the menu items on the File menu on the stack viewer menu bar.
</p>
<hr />
<!-- ********************************** -->
<h2>
<a id="TutorialGCHeap">Tutorial for GC Heap Memory Analysis</a>
</h2>
<p>
See Also <a href="#Tutorial">Tutorial of a Time-Based Investigation</a>. While there
currently is no tutorial on doing a GC heap analysis, if you have not walked the
<a href="#Tutorial">time based investigation tutorial</a> you should do so. Many
of the same concepts are used in a memory investigation. You should also take a
look at
</p>
<ul>
<li><a href="#CollectingDataGCHeap">Collecting GC Heap Data</a> </li>
<li><a href="#UnderstandingPerfDataGCHeap">Understanding GC heap data</a> </li>
<li><a href="#StartingAnAnalysisGCHeap">Starting a GC heap analysis</a> </li>
<li><a href="#GCHeapNetMemStacks">Collecting Stacks at GC allocations</a> </li>
</ul>
<p>
<!-- TODO -->
TUTORIAL NOT COMPLETE
</p>
<hr />
<hr />
<!-- *************************************************************************************** -->
<h1>
<a id="BestPracices">Performance Investigation Best Practices</a>
</h1>
<!-- ********************************** -->
<h2>
<a id="InvestigatingTime">Investigating Time</a>
</h2>
<!-- ****************** -->
<h3>
<a id="CollectingData">Collecting Event (Time Based) Profile Data</a>
</h3>
<p>
As mentioned in the <a href="#UsersGuide">introduction</a>, ETW is light weight
logging mechanism built into the Windows Operating system that can collect a broad
variety of information about what is going on in the machine. There are two
ways PerfView supports for collecting ETW profile data.
</p>
<ol>
<li>
The <strong>Collect->Run</strong> (Alt-R) menu item, which prompts for a data file
name to create and a command to run. The command turns on profiling,
runs the command, and then turns profiling off. The resulting file is then
displayed in the stack viewer. This is the preferred mechanism when it
is easy to launch the application of interest. If the command produces
output, it will be captured in the log (click the 'Log' button in the lower
right corner of the main view).
</li>
<li>
The <strong>Collect->Collect </strong>(Alt-C) menu item which only prompts for a
data file name to create. After clicking the 'Start Collection' button you are then
free to interact with machine in any way necessary to capture the activity of interest.
Since profiling is machine wide you are guaranteed to capture it. Once
you have reproduced the problem, you can dismiss the dialog box to stop profiling
and proceed to analyze the data.
</li>
</ol>
<p>
You can also automate the collection of profile data by using <a href="#CommandLineReference">command line options</a>.
See <a href="#CollectingFromCommandLine">collecting data from the command line</a>
for more.
</p>
<p>
<strong>If you intend to do a <a href="#BlockedTimeInvestigation">wall clock time investigation</a></strong>
</p>
<p>
By default PerfView chooses a set of events that does not generate too much data
but is useful for a variety of investigations. However
<a href="#BlockedTimeInvestigation">wall clock investigations</a>
require events that are too voluminous to collect by default. Thus if you wish to
do a wall clock investigation, you need to set the 'Thread Time' checkbox in the
collection dialog.
</p>
<p>
<strong>If you intend to copy the ETL file to another machine for analysis</strong>
</p>
<p>
By default to save time PerfView does NOT prepare the ETL file so that it can be
analyzed on a different machine (see <a href="#merging">merging</a>). Moreover,
there is symbolic information (PDBS for NGEN images), that also needs to be included
if the data is to work well on any machine). If you are intending to do this you
need to merge and include the NGEN pdbs by using the 'ZIP' command. You can do this
either by
</p>
<ul>
<li>
Checking the 'Zip' checkbox on the data collection dialog box when the data is being
created.
</li>
<li>
Specifying the /Zip qualifier on the command line of PerfView when the data is
being created.
</li>
<li>Right clicking on existing ETL file in the main viewer and selecting the ZIP option.</li>
</ul>
<p>
Once the data has been zipped not only does the file contain all the information
needed to resolve symbolic information, but it also has been compressed for faster
file copies. If you intend to use the data on another machine, please specify the
ZIP option.
</p>
<hr />
<!-- ********************************** -->
<h2>
<a id="ViewingStackData">Viewing Stack Data</a>
</h2>
<!-- ****************** -->
<h3>
<a id="SelectProcessDialog">Selecting a Process of Interest</a>
</h3>
<p>
The result of collecting data is an ETL file (and possibly a .kernel.ETL file as
discussed in <a href="#merging">merging</a>). When you double
click on the file in the main viewer it opens up 'children views'
of the data that was collected. One of these items will be the 'CPU
Stacks' view. Double clicking on that will bring up a stack
viewer to view the samples collected. The data in the ETL file
contains CPU information for ALL processes in the system, however most analyses
concentrate on a single process. Because of this before the stack viewer
is displayed a dialog box to select a process of interest is displayed first.
</p>
<p>
By default, this dialog box contains a list of all processes that were active at
the time the trace was collected sorted by the amount of CPU time each process consumed.
If you are doing a CPU investigation, there is a good chance the process of interest
is near the top of this list. Simply double clicking on the desired process
will bring up the stack viewer filtered to the process you chose.
</p>
<p>
The process view can be sorted by any of the columns by clicking on the column header.
Thus if you wish to find the process that was started most recently you can sort
by start time to find it quickly. If the view is sorted by name, if
you type the first character of the process name it will navigate to the first process
with that name.
</p>
<p>
<a id="ProcessFilterTextBox"><strong>Process Filter Textbox</strong></a> The box just
above the list of processes. If you type text in this box, then only
processes that match this string (PID, process name or command line, case insensitive) will
be displayed. The * character is a wild card. This is a quick
way of finding a particular process.
</p>
<p>
If you wish to see samples for more than one process for your analysis click the
'All Procs' button.
</p>
<p>
Note that the ONLY effect of the process selection dialog box is to add an '<a href="#IncPatsTextBox">Inc Pats</a>' filter that matches the process you
chose. Thus the dialog box is really just a 'friendly interface'
to the more powerful <a href="#FilteringGroupingStackData">filtering options</a>
of the stack viewer. In particular, the stack viewer still has access
to all the samples (even those outside the process you selected), it is just that
it filters it out because of the include pattern that was set by the dialog box.
This means that you can remove or modify this filter at a later point in the analysis.
</p>
<hr />
<!-- ****************** -->
<h3>
<a id="UnderstandingPerfData">Understanding Perf Data</a>
</h3>
<p>
The data shown by default in the PerfView stack viewer are stack traces taken every
millisecond on each processor on the system. Every millisecond, whatever
process is running is stopped and the operating system 'walks the stack'
associated with the running code. What is preserved when taking
a stack trace is the return address of every method on the stack. Stackwalking
may not be perfect. It is possible that the OS can't find the next
frame (leading to <a href="#BrokenStacks">broken stacks</a>) or that an optimizing
compiler has removed a method call (see <a href="#MissingFrames">missing frames</a>),
which can make analysis more difficult. However for the most part the
scheme works well, and has low overhead (typically 10% slowdown), so monitoring
can be done on 'production' systems.
</p>
<p>
On lightly loaded system, many CPUs are typically in the 'Idle' process
that the OS run when there is nothing else to do. These samples
are discarded by PerfView because they are almost never interesting.
All other samples are kept however, regardless of what process they were taken from.
Most analyses focus on a single process, and further filter all samples that did
not occur in the process of interest, however PerfView also allows you to also look
at samples from all processes as one large tree. This is useful in scenarios
where more than one process is involved end-to-end, or when you need to run an application
several times to collect enough samples.
</p>
<h4>
<a id="HowManySamples">How many samples do you need?</a>
</h4>
<p>
Because the samples are taken every millisecond per processor, each sample represents
1 millisecond of CPU time. However exactly where the sample is taken
is effectively 'random', and so it is really 'unfair' to 'charge'
the full millisecond to the routine that happened to be running at the time the
sample was taken. While this is true, it is also true that as more samples
are taken this 'unfairness' decreases as the square root of the number of
samples. If a method has just 1 or 2 samples it could be just random
chance that it happened in that particular method, but methods with 10 samples are
likely to have truly used between 7 and 13 samples (30% error). Routines with
100 samples are likely to be within 90 and 110 (10% error). For
'typical' analysis this means you want at least 1000 and preferably more
like 5000 samples (There are diminishing returns after 10K). By collecting
a few thousand samples you ensure that even moderately 'warm' methods will
have at least 10 samples, and 'hot' methods will have at least 100s, which
keep the error acceptably small. Because PerfView does not allow you
to vary the sampling frequency, this means that you need to run the scenario for
at least several seconds (for CPU bound tasks), and 10-20 seconds for less CPU bound
activities.
</p>
<p>
If the program you wish to measure cannot easily be changed to loop for the
required amount of time, you can create a batch file that repeatedly launches the
program and use that to collect data. In this case you will want to view the
CPU samples for all processes, and then use a GroupPat that erases the process
ID (e.g. process {%}=>$1) and thus groups all processes of the same name
together.
</p>
<p>
Even with 1000s of samples, there is still 'noise' that is at least in the 3% range (sqrt(1000) ~= 30 = 3%). This error gets larger as the methods / groups being investigated
have fewer samples. This makes it problematic to use sample based profiling
to compare two traces to track down small regressions (say 3%). Noise
is likely to be at least as large as the 'signal' (diff) you are trying
to track down. Increasing the number of samples will help, however you
should always keep in mind the sampling error when comparing small differences between
two traces.
</p>
<h4>Exclusive and Inclusive Metrics</h4>
<p>
Because a stack trace is collected for each sample, every node has both an exclusive
metric (the number of samples that were collected in that particular method) and
an inclusive metric (the number of samples that collected in that method or any
method that method called). Typically you are interested
in inclusive time, however it is important to realize that folding (see <a href="#FoldPatsTextBox">FoldPats</a>
and <a href="#FoldPercentTextBox">Fold %</a>) and grouping artificially
increase exclusive time (it is the time in that method (group) and anything folded
into that group). When you wish to see the internals of what was folded
into a node, you <a href="#DrillingIntoGroups">Drill Into</a> the groups to open
a view where the grouping or folding can be undone.
</p>
<hr />
<!-- ****************** -->
<h3>
<a id="StartingAnAnalysis">Starting a CPU Analysis</a>
</h3>
<p>
If you have not done so, consider walking through the <a href="#Tutorial">tutorial</a>
and best practices from <a href="http://msdn.microsoft.com/en-us/magazine/cc500596.aspx">
Measure
Early and Often for Performance
</a>.
</p>
<p>
The default stack viewer in PerfView analyzes CPU usage of your process.
There are three things that you should always do immediately when starting a CPU
analysis of a particular process.
</p>
<ol>
<li>
<strong>Determine that you have at least a few 1000 samples</strong> (preferably
over 5000). See <a href="#HowManySamples">how many samples do I need</a>
for more.
</li>
<li>
<strong>Determine that the process is actually CPU bound over time of interest</strong>.
</li>
<li>
<strong>Ensure that you have the symbolic information you need. </strong> See
<a href="#SymbolResolution">symbol resolution</a> for more.
</li>
</ol>
<p>
If either of the above conditions fail, the rest of your analysis will very likely
be inaccurate. If you don't have enough samples you need to go back
and recollect so that you get more, modifying the program to run longer, or running
the program many times to accumulate more samples. If you program
is running for long enough (typically 5-20 seconds), and you still don't have
at least 1000 samples, it is likely it is because CPU is NOT the bottleneck.
It is very common in STARTUP scenarios that CPU is NOT the problem but that the
time is being spent fetching data from the disk. It is also possible that
the program is waiting on network I/O (server responses) or responses from other
processes on the local system. In all of these cases the time being
wasted is NOT governed by how much CPU time is used, and thus a CPU analysis is
inappropriate.
</p>
<p>
You can quickly determine if your process is CPU bound by looking at the <a href="#WhenColumn">
'When'
column
</a> for your 'top most' method. If
the When column has lots of 9s or As in it over the time it is active then it is
likely the process was CPU bound during that time. This is the time you can
hope to optimize and if it is not a large fraction of the total time of your app,
then optimizing it will have little overall effect (See <a href="http://en.wikipedia.org/wiki/Ahmdal's_Law">Amdahl's Law</a>).
Switching to the <a href="#CallTreeView">
CallTree
view
</a> and looking at the 'When' column of some of the top-most
methods in the program is a good way of confirming that your application is actually
CPU bound..
</p>
<p>
Finally you may have enough samples, but you lack the symbolic information to make
sense of them. This will manifest with names with ? in them. By default
.NET code should 'just work'. For unmanaged code you need to tell
PerfView which DLLs you are interested in getting symbols for. See <a href="#SymbolResolution">symbol resolution</a> for more. You should
also quickly check that you don't have many <a href="#BrokenStacks">broken stacks</a>
as this too will interfere with analysis.
</p>
<h4>
<a id="TopDownBottomUpAnalysis">Top-down and Bottom-up Analysis</a>
</h4>
<p>
Once you have determined that CPU is actually important to optimize you have a choice
of how to do your analysis. Performance investigations can either be 'top-down'