Next: Introduction, Previous: (dir), Up: (dir) [Contents][Index]
This is the clustered-intset Reference Manual, version 0.0.1, generated automatically by Declt version 4.0 beta 2 "William Riker" on Mon Aug 15 04:16:29 2022 GMT+0.
Next: Systems, Previous: The clustered-intset Reference Manual, Up: The clustered-intset Reference Manual [Contents][Index]
clustered-intset
- a set-of-integers implementation favoring clustered keysThis is a hash-based set implementation supporting arbitrarily large non-negative integer keys, which is moderately well suited for large numbers of clustered key values.
The implementation uses Common Lisp's hash tables such that multiple set members share a hash key and hash value. This provides us the usual O(logN) capabilities of a hash table but reduces the required number of hash table entries by (for reasonably clustered data) by an order of magnitude. Furthermore, it uses only fixnums as hash table values (regardless of the size of integers in the set membership), thereby minimizing memory allocations. The net result is that if all your integer set members are fixnums, there are no extraneous allocations for references to boxed objects from your hash table (related to the intset data, at any rate). You can also have intset members which are larger than fixnums. In those cases only the hash table key is boxed, but not the value.
This release was tested in sbcl 2.1.6 and ABCL 1.8.0.
Assuming this has been provided by quicklisp and/or you've modified your quicklisp local-projects configuration to refer to the git repo cloned from https://github.com/dtenny/clustered-intset.
(ql:quickload :clustered-intset-test)
(clustered-intset-test:run-tests)
When an integer 'k' is added to the set, its hash table key is (/ k fixnum-bits)
, and its hashtable value is a fixnum used as a bitset for
members in the range 0 - fixnum-bits
relative to the base (floor k fixnum-bits)
.
Using SBCL as an example, it has 62 bit fixnums on the machine where this was
developed. Every hash table entry could represent up to 62 integer values in a
single hash table entry. An abusive example such as a dense value interval
[0-1,000,000)
would have 16,129 entries in the hash table, all keys and all
values being fixnums.
This implementation isn't designed for additional dense set techniques such as compression. It's fairly simple, and benchmarks were satisfactory w.r.t. the primary use-case that motivated this implementation. If you need more support for sparse or dense bitsets, see the references at the end of this note, in particular CL-ROARING.
Set members exceeding MOST-POSITIVE-FIXNUM will also work, potentially resulting in non-fixnum keys, but the hash-table values will still be fixnums.
I wanted to do some processing which involved representing many integer primary keys from a database in memory. In my case, I might want to pull in a million primary keys at a time and process them in ascending order.
With respect to Common Lisp tools, a million integers isn't a problem IF THEY START AT ZERO. You could use a bit-vector for that. The problem is when your integer primary key values have a range starting in the millions or billions. Billion element bit-vectors would require 125 megabytes of memory unless you do something smarter, and unfortunately I couldn't find any Common Lisp bit vector tools that were smarter with the possible exception of FSET (see below). Plain hash tables can handle large integers just fine, though you'll find that this package can substantially reduce memory requirements for large cardinality sets.
And so here we are. It isn't elegant, but it seems to do what I wanted it to do.
The focus is on integer set membership. The main things supported in the initial release are:
Hopefully the above actions are reasonably efficient.
The package doesn't presently support any bitmap-smashing types of operations. Certainly UNION (OR), INTERSECTION (AND), XOR and similar types of behaviours could be added, but since I didn't need them I haven't added them.
Feel free so submit a PR for this stuff, or you may be interested in the CL-ROARING package mentioned above and below, which supports a variety of logical bitmap operations.
I ran some very simple benchmarks (see integer-set-testing.lisp
)
of various approaches to integer sets for the element quantity and key ranges
that I required in my use-case. The benchmark may or may not run for you, it isn't
something I plan to keep up to date, was just proof of concept stuff for me.
See the lisp file for full details, here's some highlights:
;;; Using 1,000,000 intset entries, 20% holes, ranging from 1,000,000 to 2,250,239
;;; Duration in seconds, followed by bytes consed (Kilobytes/Megabytes/Gigabytes etc).
;;; Lower numbers are better.
;;;
;;; Ascending
;;; Implementation Loading Membership Traversal
;;; ================= =============== =========== =============
;;; Simple Bit Vector 0.005s, ~281K 0.003s, 0K 0.008s, 0K
;;; Hash Table 0.032s, ~24M 0.032s, 0K 0.299s, ~8M
;;; Clustered Intset 0.046s, ~490K 0.030s, 0K 0.022s, ~161K
;;; CL-Roaring (FFI) 0.014s, N/A 0.024s, N/A 0.011s, N/A
;;; FSET 1.592s, ~1.5G 0.429s, 0K 0.006s, 0K
;;; CL-intset 50.102s, ~406G 0.016s, 0K 0.020s, 0K
;;;
;;; Using 1,000,000 intset entries, 20% holes, ranging from 1,000,000,000 to 1,001,248,861
;;;
;;; Ascending
;;; Implementation Loading Membership Traversal
;;; ================= =============== =========== =============
;;; Simple Bit Vector 0.010s, ~125M 0.004s, 0K 2.542s, 0K
;;; Hash Table 0.032s, ~24M 0.031s, 0K 0.297s, ~8M
;;; Clustered Intset 0.048s, ~490K 0.029s, 0K 0.022s, ~161K
;;; CL-Roaring (FFI) 0.013s, N/A 0.023s, N/A 0.010s, N/A
;;; FSET 1.535s, ~1.5G 0.403s, 0K 0.005s, 0K
;;; CL-intset <cancelled after a couple of hours>
Key takeaways:
I probably did not give FSET a fair shake. I wasn't really looking for immutable semantics, and by the time I finally tried to use it I'd already written this package (CLUSTERED-INTSET) and also the CL-ROARING package. In my quick and dirty benchmark attempt, I also couldn't get FSET to load members efficiently. Probably all bad on my part, if there's something I missed let me know. There's also a question of how much memory it retains once it's loaded.
CL-INTSET uses single integers as bit sets, but that doesn't scale well to large keys such as database primary keys, e.g. set member 4,000,000 would require a four million bit bignum or bit-vector to represent it. It may have other scalability problems too, in benchmarking I had to kill the test where I ran set membership having range starting at one billion. There are a couple of packages that use integers as bitmaps, I'm not sure what the attraction of these implementations is, they are very impractical for large numbers.
In Clojure I've had very good experiences with data.int-map and its int-set capabilities. However aside from being written in a combination of java and clojure, the packaging is solving about 5 problems at once, most of which are more useful for Clojure than Common Lisp. Still, worth a look if you're seeking inspiration while writing another int-set or related class.
Finally, someone on reddit also suggested using FFI to access the roaringbitmap compressing bitmap implementations. In a weird development path, I wrote some of this CLUSTERED-INTSET package, then decided I didn't have anything to compare it to for benchmarking, so I wrote the CL-ROARING package, then I returned to this one.
I have to say using FFI for CL-ROARING was really nice and easy. Give it a look, it provides much more functionality than this lisp-only CLUSTERED-INTSET implementation, but it requires that you have the CRoaring shared library in your path, and it's limited to 32-bit unsigned integers, depending on where/how it was built.
Meanwhile, this package, CLUSTERED-INTSET, is pure lisp, easy to use, works well for large volumes and large values, and both the memory and CPU behavior of the package scale well enough with number of values and range of integers, subject to the limits of CL's hash-table implementation. It certainly works for my limited use case.
Next: Files, Previous: Introduction, Up: The clustered-intset Reference Manual [Contents][Index]
The main system appears first, followed by any subsystem dependency.
Implements a non-negative keyed set of integers favoring clustered keys.
Dave Tenny
MIT
0.0.1
alexandria (system).
Next: Packages, Previous: Systems, Up: The clustered-intset Reference Manual [Contents][Index]
Files are sorted by type and then listed depth-first from the systems components trees.
Next: clustered-intset/package.lisp, Previous: Lisp, Up: Lisp [Contents][Index]
clustered-intset (system).
Next: clustered-intset/clustered-intset.lisp, Previous: clustered-intset/clustered-intset.asd, Up: Lisp [Contents][Index]
clustered-intset (system).
Previous: clustered-intset/package.lisp, Up: Lisp [Contents][Index]
package.lisp (file).
clustered-intset (system).
Next: Definitions, Previous: Files, Up: The clustered-intset Reference Manual [Contents][Index]
Packages are listed by definition order.
Next: clustered-intset-asd, Previous: Packages, Up: Packages [Contents][Index]
Hash based int-set that is slightly more efficient for clustered keys
such as a set of integer primary key values from a database.
All keys must be non-negative integers.
common-lisp.
Previous: clustered-intset, Up: Packages [Contents][Index]
Next: Indexes, Previous: Packages, Up: The clustered-intset Reference Manual [Contents][Index]
Definitions are sorted by export status, category, package, and then by lexicographic order.
Next: Internals, Previous: Definitions, Up: Definitions [Contents][Index]
Next: Structures, Previous: Public Interface, Up: Public Interface [Contents][Index]
Add non-negative integer ’k’ to the the intset previously created with ‘make-intset‘. Return true if ’k’ was added, nil if ’k’ was already present.
Advance an iterator returned by ‘iterator‘. Return the next integer value in the iteration, or nil if there are no more values.
Return T if the non-negative integer ’k’ is a member of the specified intset, NIL otherwise.
Return the count of integers held by the intset. Named for sequence-like compatibiltiy.
Remove non-negative integer ’k’ from the intset previously created with ‘make-intset‘. Return true if ’k’ was removed, nil if ’k’ was already absent.
Return the smallest valued integer in the intset, or NIL if there are no members. This is not an efficient implementation, it is O(n).
Accumulate all members of the intset into a list and return it.
DIRECTION may be one of:
:UNORDERED (the default), which produces unordered results as efficiently as possible.
:ASCENDING, which produces intset members in ascending order.
:DESCENDING, which produces intset members in descending order.
Accumulate all members of the intset into a vector and return it.
VECTOR may be one of:
- nil, in which case a SIMPLE-VECTOR will be allocated, filled, and returned.
- a SIMPLE-VECTOR (i.e. ‘simple-vector-p‘ is true for the vector)
in which case elements are added from slots zero to N-1
where N is the number of elements in the intset. It is an error to specify an array
of insufficient size in this case.
- a non-adjustable vector with a fill-pointer, in which case elements are added as by
‘vector-push‘. Note that vector-push does not yield errors if the array would overflow,
it does nothing.
- a adjustable vector with a fill pointer, in which case elements are added as by
‘vector-push-extend‘.
Note that you can determine the size of the array to preallocate in full by allocating
‘(count intset)‘ elements.
DIRECTION may be one of:
:UNORDERED (the default), which produces unordered results as efficiently as possible.
:ASCENDING, which produces intset members in ascending order.
:DESCENDING, which produces intset members in descending order.
Creates and returns an unidirectional iterator for ordered traversal of the intset.
The iterator is called with ‘(advance <iterator>)‘ until it returns nil.
E.g. (loop with iterator = (iterator intset)
as value = (advance iterator)
while value
do ... something with value ...)
Values returned by ADVANCE will be ascending or descending according to the value of
DESCENDING. By default, iteration proceeds with all intset values in ascending order.
If DESCENDING is true, iterataion proceeds in descending order. As explained below,
DESCENDING affects comparison of STARTING-WITH and ENDING-WITH values.
A single iterator is always unidirectional
STARTING-WITH, if specified, is an inclusive value limiting the iteration to
start to the first value equal to or (> if DESCENDING, < if not DESCENDING) the value.
ENDING-WITH, is specified, is an exclusive value limiting the iteration
range end to the first value (> if DESCENDING, < if not DESCENDING) the value.
If both STARTING-WITH and ENDING-WITH are supplied and overlap in semantically incompatible
ways an error will be signalled, e.g.
‘(iterator intset :starting-with 1 :ending-with 1)‘ => ERROR
The intset from which the iterator was created may be updated while iterating.
It is undefined whether newly added values will be produced by new calls to ADVANCE.
If a member k is deleted from the intset undergoing iteration after the iterator
is created, it will not appear in any ADVANCE calls after deletion.
Create an empty int-set.
If you’d like to create an intset with an input sequence see SEQ->INTSET.
:size - a non-negative integer hint as to how many keys (or key clusters) will
be in the underlying hashtable. Remember that if you have good clustering,
you might have (/ N +FIXNUM-BITS+) keys, so be careful not to overallocate.
You are NOT declaring an equivalent number of keys as you would with a hash-table.
:rehash-size - indicates how to expand the table when it fills up.
If an integer, add
space for that many elements. If a floating point number (which must be
greater than 1.0), multiply the size by that amount.
Call ‘f‘, a function of one argument, with each member of the intset.
The order of element traversal is undefined. If you wanted ascending or descending
traversal, see ‘map-sorted-inset‘.
Behavior is undefined if the function modifies the intset, or if concurrent updates are
made from another thread.
Returns nil.
Call ‘f‘, a function of one argument, with each member of the inset in the ascending
or descending order according to the value of ‘direction‘, which must be :ascending or
:descending.
Note that this function is less efficient than unordered traversals of ‘map-intset‘
as it requires a sort on the intset’s hash table keys (and only those keys).
However the behavior is equally efficient to the unordered traversal within the
range of up to clustered-intset:+fixnum-bits+ values that might be associated with a
hash table key.
Behavior is undefined if the function modifies the intset, or if concurrent updates are
made from another thread.
Returns nil.
For a given non-negative integer k, which does not have to be a member of the intset,
find the next larger member of the intset. Return NIL if there is no appropriate member.
This is not an efficient implementation, this is O(n) and is a misguided convenience function.
Consider using the ITERATOR capability instead if that’s what you’re attempting to do,
or one of the mapping functions.
Create and return an intset populated from the specified sequence (vector or list). Makes a rough guess about the number of hash table keys you need (HT capacity) based on the range, or quantity, of sequence elements.
Previous: Ordinary functions, Up: Public Interface [Contents][Index]
An object representation iteration state over an INTSET
structure-object.
hash-table
clustered-intset::intset
function
(or null vector)
keyword
clustered-intset::key-type
0
clustered-intset::value-type
0
clustered-intset::fix-mod-type
0
Previous: Public Interface, Up: Definitions [Contents][Index]
Next: Ordinary functions, Previous: Constants, Up: Internals [Contents][Index]
Given an intset and a non-negative integer member k in the intset, bind vars in the list vars
as follows:
first var - the hashtable key, effectively (/ k +fixnum-bits+).
second var - the hashtable value for the hashtable key, or zero if it doesn’t exist.
third var - the modulus of (/ k +fixnum-bits+) which is the bit position for k in hash-value.
and execute ’body’ with those bindings.
Next: Structures, Previous: Macros, Up: Internals [Contents][Index]
An an iterator hasn’t been activated yet, activate it and begin looking for keys.
Return the next value, if any, or nil if there isn’t any. Update iterator state accordingly.
Return a function which acts as a filter for hash table keys
to see if they quality for iteration according to ‘iterator‘ semantics below.
We’re setting up for the first HT key in iteration, or moving to a new HT key.
Set the iterator state to reflect this HT key, and find the first bit position
with a relevant value. Assuming the iterator initialization filtered any keys without
any applicable values, we’ll always return a value for a new HT key here.
Does not maintain the remaining-ht-keys slot of the iterator.
Does update the ’current’ values iterator.
Returns the intset element value for the first value associated with the HT key, which should not be nil.
ht.
For an active iterator, if there’s another value to be retrieved for the current hash-table key, advance the iterator to that bit and return the value. Otherwise return nil.
For an active iterator, if there’s another hash key to process, advance to that key and return the value. Otherwise mark the iterator inactive and return nil.
Given a hash-table HT, and a key which may or may not be that hash-table,
return the next key (with multiple values) in the hashtable with a value greater than ‘ht-key‘,
or nil if there is no such key.
HT-KEY may be -1, which is not usually accepted in this module, for a a ’first’ like
behavior that will find the lowest valued key in the intset, there are any keys.
Returns (values nil nil) or (values next-key value-for-key).
Advance an in-progress iterator if we can, returning the next value and updating iterator state. If there are no more values, deactivate the iterator and return nil.
Given a hash-table key, the value from the hash table, and the bit position of some set member
stored in that key/value pair, return the next ascending value available for the
ht key/value pair or NIL if there is no next value.
HT-VALUE may be zero, in which case we assume there are no valid keys for the value and you just
happened to have obtained it via ‘with-hash-kv‘.
BIT-POSITION may be nil if this is the first search for a value in HT-VALUE
otherwise it is deemed to be the position or an intset member for which we want to find
a successor value, i.e. the search starts at (1+ bit-position).
For some intset member K, check to see if there is some value in the hashtable value bits
associated with the hashtable key for K ‘(floor k fixnum-bits)‘, for which
pred is true. For example:
(some-ht-val ht 43 #’<)
For 62 fixnum bits, would check ht key 0 for any ht value bits indicating quantities < 43.
Similarly:
(some-ht-val ht 143 #’<=)
For 62 fixnum bits would check for values associated with that key from (* 2 62) to
(* 3 62) that were <= 143.
If we find a PRED-qualified value, return true, otherwise return nil.
Similar to ‘ht-key-filter‘, only in thise case we’re looking at intset members values
with the optional iteration controls (starting-with, ending-with, descending). Return a
predicate which accepts an intset member value and returns true if an intset member qualifies,
w.r.t. the optional range value, nil if it does not.
Next: Types, Previous: Ordinary functions, Up: Internals [Contents][Index]
structure-object.
integer
0
Previous: Structures, Up: Internals [Contents][Index]
Type of value resulting from ‘(mod k +fixnum-bits+)‘.
Generally used to specify the position of a bit in fixnum used as a bitvector.
Type of keys stored in the hashtable.
Type of values stored in the hashtable.
Previous: Definitions, Up: The clustered-intset Reference Manual [Contents][Index]
Jump to: | %
(
A C D F H I M N S V W |
---|
Jump to: | %
(
A C D F H I M N S V W |
---|
Next: Data types, Previous: Functions, Up: Indexes [Contents][Index]
Jump to: | +
C D E H I R S V |
---|
Jump to: | +
C D E H I R S V |
---|
Jump to: | C F I K P S T V |
---|
Jump to: | C F I K P S T V |
---|