Skip to content

Commit ae53219

Browse files
committed
qdata-cpp 1.1.0: compatibility limits, depth guard, erased writer, external API support, vendoring, and tests
1 parent 3c78b26 commit ae53219

26 files changed

Lines changed: 1637 additions & 692 deletions

CHANGELOG.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
## 1.1.0 - 2026-04-08
2+
3+
### Format compatibility and limits
4+
5+
- enforced the R-compatible size limits on both read and write:
6+
- vector and list lengths capped to the R-compatible `R_XLEN_T_MAX` range
7+
- attribute counts capped to the R-compatible `R_LEN_T_MAX` / `INT_MAX` range
8+
- string payload and attribute-name lengths capped to `INT_MAX`
9+
- added native recursion-depth protection with a configurable `max_depth`
10+
parameter and a default of `512`
11+
- documented that native `qdata-cpp` preserves attributes structurally as
12+
`name + object` pairs and does not try to emulate R's special attribute-setter
13+
semantics
14+
15+
### Serialization internals
16+
17+
- replaced the old templated in-memory writer core with a shared erased writer
18+
path
19+
- kept the public templated buffer-facing serialize surface on top of that
20+
erased writer implementation
21+
- tightened the installed include tree so the standalone `include/` root is
22+
self-contained for downstream consumers
23+
24+
### Testing and vendoring
25+
26+
- added native regression coverage for the compatibility limits and `max_depth`
27+
behavior
28+
- updated the vendored `xxHash` copy from `0.8.2` to `0.8.3`
29+
30+
## 1.0.0 - 2026-04-07 - commit `1d21f34dbcaa`
31+
32+
- initial release

CMakeLists.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,12 @@ if(QDATA_BUILD_TESTS)
100100
add_test(NAME qdata_buffer_api COMMAND qdata_buffer_api)
101101
set_tests_properties(qdata_buffer_api PROPERTIES TIMEOUT 120)
102102

103+
add_executable(qdata_compat_limits tests/cpp/compat_limits.cpp)
104+
target_compile_features(qdata_compat_limits PRIVATE cxx_std_17)
105+
target_link_libraries(qdata_compat_limits PRIVATE qdata)
106+
add_test(NAME qdata_compat_limits COMMAND qdata_compat_limits)
107+
set_tests_properties(qdata_compat_limits PROPERTIES TIMEOUT 120)
108+
103109
find_program(RSCRIPT_EXECUTABLE Rscript REQUIRED)
104110
execute_process(
105111
COMMAND ${RSCRIPT_EXECUTABLE} -e

docs/experimental/full_graph.md

Lines changed: 0 additions & 32 deletions
This file was deleted.

docs/perf_experiments.md

Lines changed: 0 additions & 105 deletions
This file was deleted.

docs/qdata_spec.md

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,9 +106,31 @@ real_vector rvec_moved = get<real_vector>(std::move(obj)); // moves the real_vec
106106
- `string_ref` exposes `is_na()` and `view()`, and implicitly converts to `std::string_view`.
107107
- That implicit conversion is lossy for `NA` and yields an empty view.
108108

109+
## Compatibility Limits
110+
111+
qdata-cpp uses R-compatible size limits for serialized structure sizes:
112+
113+
- vector and list lengths are limited to `R_XLEN_T_MAX` compatibility (`2^52`)
114+
- attribute counts are limited to `R_LEN_T_MAX` / `INT_MAX`
115+
- string payload lengths and attribute-name lengths are limited to `INT_MAX`
116+
117+
These limits apply on both read and write. Native qdata-cpp therefore stays within the same intended R-compatible format subset instead of emitting or materializing larger structures that the R layer would later reject.
118+
119+
## Nesting Limit
120+
121+
qdata-cpp recursive read and write traversal uses a `max_depth` budget with a default of `512`.
122+
123+
This applies to nested list structure and recursively nested attribute values. The library rejects deeper inputs or objects instead of relying on unbounded native call-stack recursion.
124+
125+
## Attribute Semantics
126+
127+
Native qdata-cpp preserves attributes structurally as `name + object` pairs on the native object model.
128+
129+
It does not try to emulate R's special attribute setter semantics for attributes such as `dim`, `dimnames`, `class`, `tsp`, `row.names`, or `names`. Those semantics are interpreted in the R layer, not in the native qdata-cpp object model.
130+
109131
## Write-side traits
110132

111-
`C++ -> qdata` is more permissive than the read interface. It serializes directly from the source object whenever possible, and recurses naturally through nested containers.
133+
`C++ -> qdata` is more permissive than the read interface. It serializes directly from the source object whenever possible, and recurses naturally through nested containers, within the compatibility and nesting limits described above.
112134

113135
The write side is organized around four ideas:
114136

include/io/block_module.h

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
#ifndef _QS2_BLOCK_MODULE_H
22
#define _QS2_BLOCK_MODULE_H
33

4-
#include "io/io_common.h"
5-
#include "io/xxhash_module.h"
4+
#include "io_common.h"
5+
#include "xxhash_module.h"
66

77
// direct_mem switch does nothing, but is kept for parity with MT code
88
template <class stream_writer, class compressor, class hasher, class error_policy, bool direct_mem>
@@ -130,9 +130,13 @@ struct BlockCompressReader {
130130
if(!ok) {
131131
cleanup_and_throw("Unexpected end of file while reading next block size");
132132
}
133+
const uint32_t zbytes = compressed_block_size(zsize);
134+
if(!compressed_block_size_fits_buffer(zsize)) {
135+
cleanup_and_throw("Compressed block size exceeds internal maximum");
136+
}
133137
hp.update(zsize);
134-
uint32_t bytes_read = myFile.read(zblock.get(), zsize & (~BLOCK_METADATA));
135-
if(bytes_read != (zsize & (~BLOCK_METADATA))) {
138+
uint32_t bytes_read = myFile.read(zblock.get(), zbytes);
139+
if(bytes_read != zbytes) {
136140
cleanup_and_throw("Unexpected end of file while reading next block");
137141
}
138142
hp.update(zblock.get(), bytes_read);
@@ -145,9 +149,13 @@ struct BlockCompressReader {
145149
if(!ok) {
146150
cleanup_and_throw("Unexpected end of file while reading next block size");
147151
}
152+
const uint32_t zbytes = compressed_block_size(zsize);
153+
if(!compressed_block_size_fits_buffer(zsize)) {
154+
cleanup_and_throw("Compressed block size exceeds internal maximum");
155+
}
148156
hp.update(zsize);
149-
uint32_t bytes_read = myFile.read(zblock.get(), zsize & (~BLOCK_METADATA));
150-
if(bytes_read != (zsize & (~BLOCK_METADATA))) {
157+
uint32_t bytes_read = myFile.read(zblock.get(), zbytes);
158+
if(bytes_read != zbytes) {
151159
cleanup_and_throw("Unexpected end of file while reading next block");
152160
}
153161
hp.update(zblock.get(), bytes_read);
@@ -197,6 +205,9 @@ struct BlockCompressReader {
197205
std::memcpy(outbuffer, block.get()+data_offset, bytes_accounted);
198206
while(len - bytes_accounted >= MAX_BLOCKSIZE) {
199207
decompress_direct(outbuffer + bytes_accounted);
208+
if(current_blocksize != MAX_BLOCKSIZE) {
209+
cleanup_and_throw("Corrupted block data");
210+
}
200211
bytes_accounted += MAX_BLOCKSIZE;
201212
data_offset = MAX_BLOCKSIZE;
202213
}

include/io/filestream_module.h

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
#ifndef _QS2_FILESTREAM_MODULE_H
33
#define _QS2_FILESTREAM_MODULE_H
44

5-
#include "io/io_common.h"
5+
#include "io_common.h"
66

77
// in binary mode, seek/tell should be byte offsets from beginning of the file
88
// libstdc++ uses file descriptors under the hood for std::fstream:
@@ -35,7 +35,11 @@ struct IfStreamReader {
3535

3636
struct OfStreamWriter {
3737
std::ofstream con;
38-
OfStreamWriter(const char * const path) : con(path, std::ios::out | std::ios::binary) {}
38+
OfStreamWriter(const char * const path) : con(path, std::ios::out | std::ios::binary) {
39+
if(con.is_open()) {
40+
con.exceptions(std::ios::failbit | std::ios::badbit);
41+
}
42+
}
3943
bool isValid() { return con.is_open(); }
4044
uint32_t write(const char * const ptr, const uint32_t count) {
4145
con.write(ptr, count);
@@ -49,4 +53,4 @@ struct OfStreamWriter {
4953
uint64_t tellp() { return con.tellp(); }
5054
};
5155

52-
#endif
56+
#endif

include/io/io_common.h

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,31 +11,32 @@
1111

1212
#include "zstd.h"
1313
#define XXH_INLINE_ALL
14-
#include "xxhash/xxhash.h"
14+
#include "../xxhash/xxhash.h"
1515
#undef XXH_INLINE_ALL
1616

17-
#include "blosc/shuffle_routines.h"
18-
#include "blosc/unshuffle_routines.h"
17+
#include "../blosc/shuffle_routines.h"
18+
#include "../blosc/unshuffle_routines.h"
1919

20-
#ifdef QS2_DYNAMIC_BLOCKSIZE
21-
static uint64_t MAX_BLOCKSIZE = 1048576ULL;
22-
static constexpr uint64_t BLOCK_RESERVE = 64ULL;
23-
static uint64_t MIN_BLOCKSIZE = MAX_BLOCKSIZE - BLOCK_RESERVE; // smallest allowable block size, except for last block
24-
static uint64_t MAX_ZBLOCKSIZE = ZSTD_compressBound(MAX_BLOCKSIZE);
25-
#else
2620
static constexpr uint32_t MAX_BLOCKSIZE = 1048576UL;
2721
static constexpr uint32_t BLOCK_RESERVE = 64UL;
2822
static constexpr uint32_t MIN_BLOCKSIZE = MAX_BLOCKSIZE - BLOCK_RESERVE; // smallest allowable block size, except for last block
2923
static const uint32_t MAX_ZBLOCKSIZE = ZSTD_compressBound(MAX_BLOCKSIZE);
3024
// 2^20 ... we save blocksize as uint32_t, so the last 12 MSBs can be used to store metadata
3125
// This blocksize is 2x larger than `qs` and seems to be a better tradeoff overall in benchmarks
32-
#endif
3326

3427
// 11111111 11110000 00000000 00000000 in binary, First 12 MSBs can be used for metadata in either zblock or block
3528
// currently only using the first bit for metadata
3629
static constexpr uint32_t BLOCK_METADATA = 0x80000000; // 10000000 00000000 00000000 00000000
3730
static constexpr uint32_t SHUFFLE_MASK = (1ULL << 31);
3831

32+
inline constexpr uint32_t compressed_block_size(const uint32_t zsize) noexcept {
33+
return zsize & (~BLOCK_METADATA);
34+
}
35+
36+
inline constexpr bool compressed_block_size_fits_buffer(const uint32_t zsize) noexcept {
37+
return static_cast<uint64_t>(compressed_block_size(zsize)) <= MAX_ZBLOCKSIZE;
38+
}
39+
3940
// MAKE_UNIQUE_BLOCK and MAKE_SHARED_BLOCK macros should be used ONLY in initializer lists
4041
#if __cplusplus >= 201402L // Check for C++14 or above
4142
#define MAKE_UNIQUE_BLOCK(SIZE) std::make_unique<char[]>(SIZE)

include/io/multithreaded_block_module.h

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
#ifndef _QIO_MULTITHREADED_BLOCK_MODULE_H
22
#define _QIO_MULTITHREADED_BLOCK_MODULE_H
33

4-
#include "io/io_common.h"
5-
#include "io/tbb_flow_compat.h"
6-
#include "io/xxhash_module.h"
4+
#include "io_common.h"
5+
#include "tbb_flow_compat.h"
6+
#include "xxhash_module.h"
77

88
#include <atomic>
99
#include <string>
@@ -291,11 +291,16 @@ struct BlockCompressReaderMT {
291291
end_of_file.store(true);
292292
return false;
293293
}
294+
const uint32_t zbytes = compressed_block_size(zsize);
295+
if(!compressed_block_size_fits_buffer(zsize)) {
296+
tgc.cancel_group_execution();
297+
return false;
298+
}
294299
if(!available_zblocks.try_pop(zblock.block)) {
295300
zblock.block = MAKE_SHARED_BLOCK_ASSIGNMENT(MAX_ZBLOCKSIZE);
296301
}
297-
uint32_t bytes_read = this->myFile.read(zblock.block.get(), zsize & (~BLOCK_METADATA));
298-
if(bytes_read != (zsize & (~BLOCK_METADATA))) {
302+
uint32_t bytes_read = this->myFile.read(zblock.block.get(), zbytes);
303+
if(bytes_read != zbytes) {
299304
end_of_file.store(true);
300305
return false;
301306
}
@@ -374,7 +379,10 @@ struct BlockCompressReaderMT {
374379
std::memcpy(outbuffer, current_block.get()+data_offset, bytes_accounted);
375380
while(len - bytes_accounted >= MAX_BLOCKSIZE) {
376381
get_new_block();
377-
std::memcpy(outbuffer + bytes_accounted, current_block.get(), current_blocksize);
382+
if(current_blocksize != MAX_BLOCKSIZE) {
383+
cleanup_and_throw("Corrupted block data");
384+
}
385+
std::memcpy(outbuffer + bytes_accounted, current_block.get(), MAX_BLOCKSIZE);
378386
bytes_accounted += MAX_BLOCKSIZE;
379387
data_offset = MAX_BLOCKSIZE;
380388
}

include/io/xxhash_module.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#ifndef _QS2_XXHASH_MODULE_H
22
#define _QS2_XXHASH_MODULE_H
33

4-
#include "io/io_common.h"
4+
#include "io_common.h"
55

66
struct xxHashEnv {
77
XXH3_state_t* state;

0 commit comments

Comments
 (0)