You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: FAQ.md
+27-12Lines changed: 27 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -817,29 +817,44 @@ Now, stepping back: you may unfortunately be in the case 3 above -- the numerica
817
817
818
818
## Is there a way to get the performance of an Operator
819
819
820
-
Yes, any logging level below or equal to ``PERF`` will do the trick. For example, if you run:
820
+
Yes — any logging level at or below ``PERF`` will display performance-related information. For example, running:
821
+
821
822
```
822
823
DEVITO_LOGGING=PERF python your_code.py
823
824
```
824
-
you will see that Devito emits lots of useful information concerning the performance of an Operator. The following is reported:
825
825
826
-
* the code generation, compilation, and execution times;
827
-
* for each section in the generated code, its execution time, operational intensity (OI), GFlops/s and GPts/s performance;
828
-
* global GFlops/s and GPts/s performance of the Operator (i.e., cumulative across all sections);
829
-
* in the case of an MPI run, per-rank GFlops/s and GPts/s performance.
826
+
will cause Devito to emit detailed performance metrics about the Operator. The output includes:
827
+
828
+
* Time taken for code generation, compilation, and execution;
829
+
* For each section in the generated code: execution time, operational intensity (OI), GFlops/s, and GPts/s;
830
+
* Overall (global) GFlops/s and GPts/s performance of the Operator (i.e., aggregated across all sections);
831
+
* In MPI runs, per-rank GFlops/s and GPts/s performance.
832
+
833
+
### About the GPts/s Metric
830
834
831
-
About the GPts/s metric, that is number of gigapoints per seconds. The "points" we refer to here are the actual grid points -- so if the grid is an ``N**3`` cube, the number of timesteps is ``T``, and the Operator runs for ``S``secs, then we have ``N**3*T/S GPts/s``. This is the typical metric used for finitedifference codes.
835
+
GPts/s stands for "gigapoints per second." Here, "points" refers to grid points. For a grid shaped like an ``N**3`` cube, run for ``T`` timesteps and completed in ``S``seconds, the performance in GPts/s is ``N**3 * T / S``. This is a common metric for finite-difference workloads.
832
836
833
-
An excerpt of the performance profile emitted by Devito upon running an Operator is provided below. In this case, the Operator has two sections, ``section0`` and ``section1``, and ``section1`` consists of two consecutive 6D iteration spaces whose size is given between angle brackets.
837
+
### Example Output
838
+
839
+
Below is an example of the performance profile Devito emits when an Operator is executed. In this case, the Operator has four sections, with the first two dominating execution time:
834
840
835
841
```default
836
-
Global performance: [OI=0.16, 8.00 GFlops/s, 0.04 GPts/s]
842
+
Operator `MyOperator` ran in 0.68 s
843
+
Global performance: [OI=0.01, 50.59 GFlops/s, 1.16 GPts/s]
844
+
Global performance <w/o setup>: [0.24 s, 3.27 GPts/s]
837
845
Local performance:
838
-
* section0<136,136,136> run in 0.10 s [OI=0.16, 0.14 GFlops/s]
839
-
* section1<<341,16,16,14,14,136>,<341,16,16,8,8,130>> run in 35.91 s [OI=5.36, 8.01 GFlops/s, 0.05 GPts/s]
846
+
* section0 ran in 0.07 s [OI=0.06, 2063.30 GFlops/s, 11.28 GPts/s]
847
+
* section1 ran in 0.17 s [OI=0.01, 4984.02 GFlops/s, 4.60 GPts/s]
848
+
* section2 ran in 0.01 s
849
+
* section3 ran in 0.01 s
840
850
```
841
851
842
-
Note that ``section0`` doesn't show the GPts/s. This is because no TimeFunction is written in this section.
852
+
Notes:
853
+
* The first line shows the total runtime of the Operator, including Python overhead, transition into executing the JIT-compiled binary, and any CPU-GPU data transfers.
854
+
855
+
* The two "Global performance" lines differ as follows: the <w/o setup> version excludes overhead (e.g., CPU-GPU data transfers) and better reflects actual compute performance, since such overheads are easily amortized way in production-grade runs when a simulation is executed over hundreds/thousands of timesteps.
856
+
857
+
* The "Local performance" section provides a breakdown by code section. These sections are determined by the Devito compiler based on internal heuristics. Typically, loops related to finite-difference evaluation are grouped together, while operations such as source injection or receiver interpolation are placed in separate sections.
0 commit comments