Introduction
In traditional SQL systems, a column’s type is determined when the table is created, and never changes while executing a query. If you create a table with an integer-valued column, the values in that column will always be integers (or possibly NULL
).
Rockset, however, is dynamically typed, which means that we often don’t know the type of a value until we actually execute the query. This is similar to other dynamically typed programming languages, where the same variable may contain values of different types at different points in time:
$ python3
>>> a = 3
>>> type(a)
<class 'int'>
>>> a="foo"
>>> type(a)
<class 'str'>
Rockset’s type system was initially based on JSON, and has since been extended to support other types as well:
bytes
: taking a cue from Python, we distinguish between sequences of valid Unicode characters (string
, which is internally represented as UTF-8) and sequences of arbitrary bytes (bytes
)- date- and time-specific types (
date
,time
,datetime
,timestamp
,microsecond_interval
,month_interval
)
There are other types that we use internally (and are never exposed to our users); also, the type system is extensible, with planned support for decimal
(base-10 floating-point), geometry / geography types, and others.
In the following example, collection ivtest
has documents containing one field a
, which takes a variety of types:
$ rock create collection ivtest
Collection "ivtest" was created successfully in workspace "commons".
$ cat /tmp/a.docs
{"a": 2}
{"a": "hello"}
{"a": null}
{"a": {"b": 10}}
{"a": [2, "foo"]}
$ rock upload ivtest /tmp/a.docs
{
"file_name":"a.docs",
"file_upload_id":"c5ccc261-0096-4a73-8dfe-d6db8b8d130e",
"uploaded_at":"2019-06-05T18:12:46Z"
}
$ rock sql
> select typeof(a), a from ivtest order by a;
+-----------+------------+
| ?typeof | a |
|-----------+------------|
| null_type | <null> |
| int | 2 |
| string | hello |
| array | [2, 'foo'] |
| object | {'b': 10} |
+-----------+------------+
Time: 0.014s
This post shows one of many challenges that we encountered while building a fully dynamically typed SQL database: how we manipulate values of unknown types in our query execution backend (written in C++), while approaching the performance of using native types directly.
At first, we used protocol buffers similar to the definition below (simplified to only show integers, floats, strings, arrays, and objects; the actual oneof
that we use has a few extra fields):
message Value {
oneof value_union {
int64 int_value = 1;
double float_value = 2;
string string_value = 3;
ArrayValue array_value = 4;
ObjectValue object_value = 5;
}
}
message ArrayValue {
repeated Value values = 1;
}
message ObjectValue {
repeated KeyValue kvs = 1;
}
message KeyValue {
string key = 1;
Value value = 2;
}
But we quickly realized that this is inefficient, both in terms of speed and in terms of memory usage. First, protobuf requires a heap memory allocation for every object; creating a Value
that contains an array of 10 integers would perform:
- a memory allocation for the top-level
Value
- an allocation for the
array_value
member - an allocation for the list of values (
ArrayValue.values
, which is aRepeatedPtrField
) - an allocation for each of the 10 values in the array
for a total of 13 memory allocations.
Also, the 10 values in the array are not allocated contiguously in memory, which causes a further decrease in performance due to cache locality.
It was quickly clear that we needed something better, which we called IValue
. Compared to the protobuf version, IValue
is:
- More memory efficient: while not as efficient as using native types directly,
IValue
must be small, and must avoid heap allocations wherever possible.IValue
is always 16 bytes, and does not allocate heap memory for integers, booleans, floating-point numbers, and short strings. - Faster: arrays of scalar
IValue
s are allocated contiguously in memory, leading to better cache locality. This is not as efficient as using native types directly, but it is a significant improvement over protobuf.
Most of Rockset’s query execution engine operates on IValue
s (there are some parts that have specialized implementation for specific types, and this is an area of active improvement).
We’d like to share an overview of the IValue
design. Note that IValue
is optimized for Rockset’s needs and is not meant to be portable — we use Linux and x86_64-specific tricks, and assume a little-endian memory layout.
The idea is in itself not novel; the techniques that we use date back to at least 1993, as surveyed in “Representing Type Information in Dynamically Typed Languages”. We decided to make IValue
128 bits instead of 64, as it allows us to avoid heap allocations in more cases (including all 64-bit integers); using the taxonomy defined in the paper, IValue
is a double-wrapper scheme with qualifiers.
Internally, IValue
is represented as a 128-bit (16-byte) value, consisting of:
- a 64-bit field (called
data
) - a 48-bit field (called
pointer
, as it often, but not always, stores a pointer) - two 8-bit discriminator fields (called
tag0
andtag1
)
tag1
indicates the type of the value. tag0
is usually a subtype, and the meaning of the other two fields changes depending on type. The pointer
field is often a pointer to some other data structure, allocated on the heap, for the cases where heap allocations can’t be avoided; as pointers are only 48 bits on x86_64, we are able to fit a pointer and the two discriminator fields in the same uint64_t
.
We recognize two types of IValue
s:
tag1
has bit 7 clear (tag1 < 0x80
) for all immediate values, and set (tag1 >= 0x80
) for all non-immediate values. This allows us to distinguish between immediate and non-immediate values very quickly, using one simple bit operation. We can then copy, hash, and compare for equality immediate values by treating them as a pair of uint64_t
integers.
Scalar Types
The representation for most scalar types is straightforward: tag0
is usually zero, tag1
identifies the type, pointer
is usually zero, and data
contains the value.
SQL NULL
is all zeros, which is convenient (memset()
ing a chunk of memory to zero makes it NULL
when interpreted as IValue
):
Booleans have data = 0
for false
and data = 1
for true
, tag1 = 0x01
Integers have the value stored in data
(as int64_t
) and tag1 = 0x02
And so on. The layouts for other scalar types (floating point, date / time, etc) are similar.
Strings
We handle character strings and byte strings similarly; the value of tag1
is the only difference. For the rest of the section, we’ll only focus on character strings.
IValue
strings are immutable, maintain the string’s length explicitly, and are not null-terminated. In line with our goal to minimize heap allocations, IValue
doesn’t use any external memory for short strings (less than 16 bytes).
Instead, we implement the small string optimization: we store the string contents (padded with nulls) in the data
, pointer
, and tag0
fields; we store the string length in the tag1
field: tag1
is 0x1n
, where n
is the string’s length.
An empty string has tag1
= 0x10
and all other bytes zero:
And, for example, the 11-byte string “Hello world” has tag1
= 0x1b
(note the little-endian representation; the byte 'H'
is first):
Strings longer than 15 bytes are stored out-of-line: tag1
is 0x80
, pointer
points to the beginning of the string (allocated on the heap using malloc()
), and data
contains the string length. (There is also the possibility of referencing a “foreign” string, where IValue
doesn’t own the memory but points inside a preallocated buffer, but that is beyond the scope of this post.)
For example, the 19-byte string “Rockset is awesome!”:
Vectors
Vectors (which we call “arrays”, adopting JSON’s terminology) are similarly allocated on the heap: they are similar to vectors in most programming languages (including C++’s std::vector
). tag1
is 0x82
, pointer
points to the beginning of the vector (allocated on the heap using malloc()
), and data
contains the vector’s size and capacity (32 bits each). The vector itself is a contiguously allocated block of capacity()
IValue
s (capacity() * 16
bytes); when reallocation is needed, the vector grows exponentially (with a factor that is less than 2, for the reasons described in Facebook’s fbvector
implementation.)
Hash Maps
Maps (which we call “objects”, adopting JSON’s terminology) are also allocated on the heap. We represent objects as open-addressing hash tables with quadratic probing; the size of the table is always a power of two, which simplifies probing. We probe with triangular numbers, just like Google’s sparsehash, which. as Knuth tells us in The Art of Computer Programming (volume 3, chapter 6.4, exercise 20), automatically covers all slots.
Each hash table slot is 32 bytes — two IValue
s, one for the key, one for the value. As is usually the case with open-addressing hash tables, we need two special keys — one to represent empty slots, and one to represent deleted elements (tombstones). We reserve two values of tag1
for that purpose (0x06
and 0x05
, respectively).
The pointer
field points to the beginning of the hash table (a contiguous array of slots, allocated on the heap using malloc()
.) We store the current size of the hash table in the least-significant 32 bits of the data
field. The tag0
field contains the number of allocated slots (as it’s always a power of two, we store log2(number of slots) + 1
, or zero if the table is empty).
The capacity
field (most significant 32 bits of data
) deserves further interest: it is the number of slots available for storing user data. Initially, it is the same as the total number of slots, but, as in all open-addressing hash tables, erasing an element from the table marks the slot as “deleted” and renders it unusable until the next rehash. So erasing an element actually decreases the table’s capacity.
Performance
IValue
gives a substantial performance improvement over the old protobuf-based implementation:
- creating arrays of strings is between 2x and 7x faster (depending on the string size; because of the small-string optimization,
IValue
is significantly faster for small strings) - creating arrays of integers is also 7x faster (because we no longer allocate memory for every individual array element)
- iterating over large arrays of integers is 3x faster (because the values in the array are now allocated contiguously)
Future Work
Even though Rockset documents are allowed to contain data of multiple types in the same field, the situation shown in the introduction is relatively rare. In practice, most of the data is of the same type (or NULL
), and, to recognize this, we are extending IValue
to support homogeneous arrays.
All elements in a homogeneous array are of the same type (or NULL
). The structure is similar to the regular (heterogeneous) arrays (described above), but the pointer
field points directly to an array of the native type (int64_t
for an array of integers, double
for an array of floating-point values, etc). Similar to systems like Apache Arrow, we also maintain an optional bitmap that indicates whether a specific value is NULL
or not.
The query execution code recognizes the common case where it produces a column of values of the same type, in which case it will generate a homogeneous array. We have efficient, vectorized implementations of common database operations on homogeneous arrays, allowing us significant performance improvements in the common case.
This is still an area of active work, and benchmark results are forthcoming.
Conclusion
We hope that you enjoyed a brief look under the hood of Rockset’s engine. In the future, we’ll share more details about our approaches to building a fully dynamically typed SQL database; if you’d like to give us a try, sign up for an account; if you’d like to help build this, we’re hiring!