Skip to content

Commit cadca27

Browse files
misc: Update FAQ.md
1 parent 8d40f4f commit cadca27

1 file changed

Lines changed: 27 additions & 12 deletions

File tree

FAQ.md

Lines changed: 27 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -817,29 +817,44 @@ Now, stepping back: you may unfortunately be in the case 3 above -- the numerica
817817

818818
## Is there a way to get the performance of an Operator
819819

820-
Yes, any logging level below or equal to ``PERF`` will do the trick. For example, if you run:
820+
Yes — any logging level at or below ``PERF`` will display performance-related information. For example, running:
821+
821822
```
822823
DEVITO_LOGGING=PERF python your_code.py
823824
```
824-
you will see that Devito emits lots of useful information concerning the performance of an Operator. The following is reported:
825825

826-
* the code generation, compilation, and execution times;
827-
* for each section in the generated code, its execution time, operational intensity (OI), GFlops/s and GPts/s performance;
828-
* global GFlops/s and GPts/s performance of the Operator (i.e., cumulative across all sections);
829-
* in the case of an MPI run, per-rank GFlops/s and GPts/s performance.
826+
will cause Devito to emit detailed performance metrics about the Operator. The output includes:
827+
828+
* Time taken for code generation, compilation, and execution;
829+
* For each section in the generated code: execution time, operational intensity (OI), GFlops/s, and GPts/s;
830+
* Overall (global) GFlops/s and GPts/s performance of the Operator (i.e., aggregated across all sections);
831+
* In MPI runs, per-rank GFlops/s and GPts/s performance.
832+
833+
### About the GPts/s Metric
830834

831-
About the GPts/s metric, that is number of gigapoints per seconds. The "points" we refer to here are the actual grid points -- so if the grid is an ``N**3`` cube, the number of timesteps is ``T``, and the Operator runs for ``S`` secs, then we have ``N**3*T/S GPts/s``. This is the typical metric used for finite difference codes.
835+
GPts/s stands for "gigapoints per second." Here, "points" refers to grid points. For a grid shaped like an ``N**3`` cube, run for ``T`` timesteps and completed in ``S`` seconds, the performance in GPts/s is ``N**3 * T / S``. This is a common metric for finite-difference workloads.
832836

833-
An excerpt of the performance profile emitted by Devito upon running an Operator is provided below. In this case, the Operator has two sections, ``section0`` and ``section1``, and ``section1`` consists of two consecutive 6D iteration spaces whose size is given between angle brackets.
837+
### Example Output
838+
839+
Below is an example of the performance profile Devito emits when an Operator is executed. In this case, the Operator has four sections, with the first two dominating execution time:
834840

835841
```default
836-
Global performance: [OI=0.16, 8.00 GFlops/s, 0.04 GPts/s]
842+
Operator `MyOperator` ran in 0.68 s
843+
Global performance: [OI=0.01, 50.59 GFlops/s, 1.16 GPts/s]
844+
Global performance <w/o setup>: [0.24 s, 3.27 GPts/s]
837845
Local performance:
838-
* section0<136,136,136> run in 0.10 s [OI=0.16, 0.14 GFlops/s]
839-
* section1<<341,16,16,14,14,136>,<341,16,16,8,8,130>> run in 35.91 s [OI=5.36, 8.01 GFlops/s, 0.05 GPts/s]
846+
* section0 ran in 0.07 s [OI=0.06, 2063.30 GFlops/s, 11.28 GPts/s]
847+
* section1 ran in 0.17 s [OI=0.01, 4984.02 GFlops/s, 4.60 GPts/s]
848+
* section2 ran in 0.01 s
849+
* section3 ran in 0.01 s
840850
```
841851

842-
Note that ``section0`` doesn't show the GPts/s. This is because no TimeFunction is written in this section.
852+
Notes:
853+
* The first line shows the total runtime of the Operator, including Python overhead, transition into executing the JIT-compiled binary, and any CPU-GPU data transfers.
854+
855+
* The two "Global performance" lines differ as follows: the <w/o setup> version excludes overhead (e.g., CPU-GPU data transfers) and better reflects actual compute performance, since such overheads are easily amortized way in production-grade runs when a simulation is executed over hundreds/thousands of timesteps.
856+
857+
* The "Local performance" section provides a breakdown by code section. These sections are determined by the Devito compiler based on internal heuristics. Typically, loops related to finite-difference evaluation are grouped together, while operations such as source injection or receiver interpolation are placed in separate sections.
843858

844859

845860
[top](#Frequently-Asked-Questions)

0 commit comments

Comments
 (0)