Skip to content

Commit 2c571ad

Browse files
authored
Merge branch 'gcarreno:main' into main
2 parents 044e876 + f651a24 commit 2c571ad

2 files changed

Lines changed: 45 additions & 38 deletions

File tree

entries/abouchez/README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ I am very happy to share decades of server-side performance coding techniques us
2020

2121
Here are the main ideas behind this implementation proposal:
2222

23-
- **mORMot** makes cross-platform and cross-compiler support simple - e.g. `TMemMap`, `TDynArray.Sort`,`TTextWriter`, `SetThreadCpuAffinity`, `crc32c`, `ConsoleWrite` or command-line parsing;
23+
- **mORMot** makes cross-platform and cross-compiler support simple - e.g. `TMemMap`, `TDynArray`,`TTextWriter`, `SetThreadCpuAffinity`, `crc32c`, `ConsoleWrite` or command-line parsing;
2424
- The entire 16GB file is `memmap`ed at once into memory - it won't work on 32-bit OS, but avoid any `read` syscall or memory copy;
2525
- Process file in parallel using several threads - configurable via the `-t=` switch, default being the total number of CPUs reported by the OS;
2626
- Input is fed into each thread as 64MB chunks: because thread scheduling is unbalanced, it is inefficient to pre-divide the size of the whole input file into the number of threads;
@@ -32,20 +32,22 @@ Here are the main ideas behind this implementation proposal:
3232
- Parse temperatures with a dedicated code (expects single decimal input values);
3333
- The station names are stored as UTF-8 pointers to the memmap location where they appear first, in `StationName[]`, to be emitted eventually for the final output, not during temperature parsing;
3434
- No memory allocation (e.g. no transient `string` or `TBytes`) nor any syscall is done during the parsing process to reduce contention and ensure the process is only CPU-bound and RAM-bound (we checked this with `strace` on Linux);
35-
- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target);
35+
- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target) - perhaps making it less readable, because we used pointer arithmetics when it matters (I like to think as such low-level pascal code as [portable assembly](https://sqlite.org/whyc.html#performance) similar to "unsafe" code in managed languages);
3636
- Can optionally output timing statistics and resultset hash value on the console to debug and refine settings (with the `-v` command line switch);
3737
- Can optionally set each thread affinity to a single core (with the `-a` command line switch).
3838

3939
If you are not convinced by the "perfect hash" trick, you can define the `NOPERFECTHASH` conditional, which forces full name comparison, but is noticeably slower. Our algorithm is safe with the official dataset, and gives the expected final result - which was the goal of this challenge: compute the right data reduction with as little time as possible, with all possible hacks and tricks. A "perfect hash" is a well known hacking pattern, when the dataset is validated in advance. And since our CPUs offers `crc32c` which is perfect for our dataset... let's use it! https://en.wikipedia.org/wiki/Perfect_hash_function ;)
4040

4141
## Why L1 Cache Matters
4242

43-
Take great care of the "64 bytes cache line" is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance.
43+
Taking special care of the "64 bytes cache line" is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance.
4444

4545
The L1 cache is well known in the performance hacking litterature to be the main bottleneck for any efficient in-memory process. If you want things to go fast, you should flatter your CPU L1 cache.
4646

4747
Min/max values will be reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;)
4848

49+
As a result, each `Station[]` entry takes only 16 bytes, so we can fit exactly 4 entries in a single CPU L1 cache line. To be fair, if we put some more data into the record (e.g. use `Int64` instead of `smallint`/`integer`), the performance degrades only for a few percents. The main fact seems to be that the entry is likely to fit into a single cache line, even if filling two cache lines may be sometimes needed for misaligned data.
50+
4951
In our first attempt (see "Old Version" below), we stored the name into the `Station[]` array, so that each entry is 64 bytes long exactly. But since `crc32c` is a perfect hash function for our dataset, it is enough to just store the 32-bit hash instead, and not the actual name.
5052

5153
Note that if we reduce the number of stations from 41343 to 400, the performance is much higher, also with a 16GB file as input. The reason is that since 400x16 = 6400, each dataset could fit entirely in each core L1 cache. No slower L2/L3 cache is involved, therefore performance is better. The cache memory seems to be the bottleneck of our code. Which is a good sign.
@@ -236,6 +238,6 @@ Benchmark 1: abouchez
236238
```
237239
It is a known fact from experiment that forcing thread affinity is not a good idea, and it is always much better to let any modern Operating System do the threads scheduling to the CPU cores, because it has a much better knowledge of the actual system load and status. Even on a "fair" CPU architecture like AMD Zen. For a "pure CPU" process, affinity may help a very little. But for our "old" process working outside of the L1 cache limits, we better let the OS decide.
238240

239-
So with this "old" version, it was decided to use `-t=16`. The "old" version is using a whole cache line (16 bytes) for its `Station[]` record, so it may be the responsible of using too much CPU cache, so more than 16 threads does not make a difference with it. Whereas our "new" version, with its `Station[]` of only 16 bytes, could use `-t=32` with benefits. The cache memory access is likely to be the bottleneck from now on.
241+
So with this "old" version, it was decided to use `-t=16`. The "old" version is using a whole cache line (64 bytes) for its `Station[]` record, so it may be the responsible of using too much CPU cache, so more than 16 threads does not make a difference with it. Whereas our "new" version, with its `Station[]` of only 16 bytes, could use `-t=32` with benefits. The cache memory access is likely to be the bottleneck from now on.
240242

241243
Arnaud :D

entries/abouchez/src/brcmormot.lpr

Lines changed: 39 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -327,29 +327,31 @@ function Average(sum, count: PtrInt): PtrInt;
327327
//ConsoleWrite([sum / (count * 10), ' ', result / 10]);
328328
end;
329329

330-
function ByStationName(const A, B): integer;
330+
function ByStationName(const A, B): integer; // = StrComp() but ending with ';'
331331
var
332332
pa, pb: PByte;
333+
c: byte;
333334
begin
334335
result := 0;
335336
pa := pointer(A);
336337
pb := pointer(B);
337-
if pa = pb then
338+
dec(pa, {%H-}PtrUInt(pb));
339+
if pa = nil then
338340
exit;
339341
repeat
340-
if pa^ <> pb^ then
342+
c := PByteArray(pa)[{%H-}PtrUInt(pb)];
343+
if c <> pb^ then
341344
break
342-
else if pa^ = ord(';') then
345+
else if c = ord(';') then
343346
exit; // Str1 = Str2
344-
inc(pa);
345347
inc(pb);
346348
until false;
347-
if pa^ = ord(';') then
349+
if (c = ord(';')) or
350+
((pb^ <> ord(';')) and
351+
(c < pb^)) then
348352
result := -1
349-
else if pb^ = ord(';') then
350-
result := 1
351353
else
352-
result := pa^ - pb^;
354+
result := 1;
353355
end;
354356

355357
function TBrcMain.SortedText: RawUtf8;
@@ -368,36 +370,39 @@ function TBrcMain.SortedText: RawUtf8;
368370
assert(c <> 0);
369371
DynArraySortIndexed(
370372
pointer(fList.StationName), SizeOf(PUtf8Char), c, ndx, ByStationName);
371-
// generate output
372-
FastSetString(result, nil, 1200000); // pre-allocate result
373-
st := TRawByteStringStream.Create(result);
374373
try
375-
w := TTextWriter.Create(st, @tmp, SizeOf(tmp));
374+
// generate output
375+
FastSetString(result, nil, 1200000); // pre-allocate result
376+
st := TRawByteStringStream.Create(result);
376377
try
377-
w.Add('{');
378-
n := ndx.buf;
379-
repeat
380-
s := @fList.Station[n^];
381-
assert(s^.Count <> 0);
382-
p := fList.StationName[n^];
383-
w.AddNoJsonEscape(p, NameLen(p));
384-
AddTemp(w, '=', s^.Min);
385-
AddTemp(w, '/', Average(s^.Sum, s^.Count));
386-
AddTemp(w, '/', s^.Max);
387-
dec(c);
388-
if c = 0 then
389-
break;
390-
w.Add(',', ' ');
391-
inc(n);
392-
until false;
393-
w.Add('}');
394-
w.FlushFinal;
395-
FakeLength(result, w.WrittenBytes);
378+
w := TTextWriter.Create(st, @tmp, SizeOf(tmp));
379+
try
380+
w.Add('{');
381+
n := ndx.buf;
382+
repeat
383+
s := @fList.Station[n^];
384+
assert(s^.Count <> 0);
385+
p := fList.StationName[n^];
386+
w.AddNoJsonEscape(p, NameLen(p));
387+
AddTemp(w, '=', s^.Min);
388+
AddTemp(w, '/', Average(s^.Sum, s^.Count));
389+
AddTemp(w, '/', s^.Max);
390+
dec(c);
391+
if c = 0 then
392+
break;
393+
w.Add(',', ' ');
394+
inc(n);
395+
until false;
396+
w.Add('}');
397+
w.FlushFinal;
398+
FakeLength(result, w.WrittenBytes);
399+
finally
400+
w.Free;
401+
end;
396402
finally
397-
w.Free;
403+
st.Free;
398404
end;
399405
finally
400-
st.Free;
401406
ndx.Done;
402407
end;
403408
end;

0 commit comments

Comments
 (0)