Skip to content

Conversation

@pmurphy979
Copy link
Contributor

@pmurphy979 pmurphy979 commented Feb 26, 2025

Implements this request: partbyenum for ints

  • Modified WDB in partbyenum mode to partition by "raw" column values if the single p# column in sort.csv has type short, int, or long. Values are clamped between 0 and 2147483647 (i.e. 0Wi). Based on testing, a table with partition values above 0Wi can show some funny behaviour in its int column.
  • Added a WDB maptoint function to handle symbol and integer encoding and updated the IDB maptoint function with equivalent logic. Ideally the IDB would just get the same function definition from the WDB, but there are minor differences (WDB version uses on-disk sym file and must explicitly cast enumerated symbols to long). Testing shows no major performance change between old and new IDB maptoint functions.
  • Added new WDB tests for partitioning by integer column types and confirmed new and existing tests pass.
  • Updated comments and docs to mention partbyenum can also support an integer column.
  • Fixed some typos.

@pmurphy979
Copy link
Contributor Author

Unit tests

Existing tests

$ bash tests/stp/wdb/run.sh -d
...
q)select count i by file, ok, okms, okbytes, valid from KUTR
file                                                 ok okms okbytes valid| x 
--------------------------------------------------------------------------| --
:/home/pmurphy/git/TorQ/tests/stp/wdb/partbyenum.csv 1  1    1       1    | 13
:/home/pmurphy/git/TorQ/tests/stp/wdb/singlelog.csv  1  1    1       1    | 22
:/home/pmurphy/git/TorQ/tests/stp/wdb/tabperiod.csv  1  1    1       1    | 30
:/home/pmurphy/git/TorQ/tests/stp/wdb/tabular.csv    1  1    1       1    | 30

$ bash tests/stp/idbpartbyenum/run.sh -d
...
q)select count i by file, ok, okms, okbytes, valid from KUTR
file                                                        ok okms okbytes valid| x 
---------------------------------------------------------------------------------| --
:/home/pmurphy/git/TorQ/tests/stp/idbpartbyenum/idbenum.csv 1  1    1       1    | 21

$ bash tests/wdb/nullpartbyenum/run.sh -d
...
q)select count i by file, ok, okms, okbytes, valid from KUTR
file                                                                ok okms okbytes valid| x 
-----------------------------------------------------------------------------------------| --
:/home/pmurphy/git/TorQ/tests/wdb/nullpartbyenum/nullpartbyenum.csv 1  1    1       1    | 14

New tests

$ bash tests/wdb/intpartbyenum/run.sh -d
...
q)select count i by file, ok, okms, okbytes, valid from KUTR
file                                                              ok okms okbytes valid| x 
---------------------------------------------------------------------------------------| --
:/home/pmurphy/git/TorQ/tests/wdb/intpartbyenum/intpartbyenum.csv 1  1    1       1    | 10

@pmurphy979
Copy link
Contributor Author

Int partition limits

The initial plan was to allow all partition values right up to the max long integer value 9223372036854775807, however the int column seems to show an integer overflow with certain (and perhaps not realistic) access patterns.

Clamping at the max int value 2147483647 looks like a safer bet, since this overflow is avoided and 2 billion different partitions is probably more than necessary anyway.

// Function to splay a table with one integer value in an int partition with the same value
q)makepar:{(` sv .Q.par[`:.;x;`testtab],`) set ([]expint:enlist x)}

// Make and load an int partitioned database from long integer values
// Values are in and around the min and max non-negative short, int, and long values
q)system "mkdir longdb"
q)system "cd longdb"
q)string makepar each 0 1 32766 32767 32768 2147483646 2147483647 2147483648 9223372036854775805 9223372036854775806 9223372036854775807
":./0/testtab/"
":./1/testtab/"
":./32766/testtab/"
":./32767/testtab/"
":./32768/testtab/"
":./2147483646/testtab/"
":./2147483647/testtab/"
":./2147483648/testtab/"
":./9223372036854775805/testtab/"
":./9223372036854775806/testtab/"
":./0W/testtab/"
q)\l .
q)meta testtab
c     | t f a
------| -----
int   | j    
expint| j

// int and expected int agree
q)testtab
int                 expint             
---------------------------------------
0                   0                  
1                   1                  
32766               32766              
32767               32767              
32768               32768              
2147483646          2147483646         
2147483647          2147483647         
2147483648          2147483648         
9223372036854775805 9223372036854775805
9223372036854775806 9223372036854775806

// int and expected int still agree
q)select from testtab
int                 expint             
---------------------------------------
0                   0                  
1                   1                  
32766               32766              
32767               32767              
32768               32768              
2147483646          2147483646         
2147483647          2147483647         
2147483648          2147483648         
9223372036854775805 9223372036854775805
9223372036854775806 9223372036854775806
0W                  0W                 

// int and expected int DO NOT agree
q)select int, expint from testtab
int         expint             
-------------------------------
0           0                  
1           1                  
32766       32766              
32767       32767              
32768       32768              
2147483646  2147483646         
2147483647  2147483647         
-2147483648 2147483648         
-2147483648 9223372036854775805
-2147483648 9223372036854775806
-2147483648 0W                 

// Order of columns in query doesn't matter
q)select expint, int from testtab
expint              int        
-------------------------------
0                   0          
1                   1          
32766               32766      
32767               32767      
32768               32768      
2147483646          2147483646 
2147483647          2147483647 
2147483648          -2147483648
9223372036854775805 -2147483648
9223372036854775806 -2147483648
0W                  -2147483648

// Not just a print error - the values are not the same
q)update match:int~'expint from select int, expint from testtab
int         expint              match
-------------------------------------
0           0                   1    
1           1                   1    
32766       32766               1    
32767       32767               1    
32768       32768               1    
2147483646  2147483646          1    
2147483647  2147483647          1    
-2147483648 2147483648          0    
-2147483648 9223372036854775805 0    
-2147483648 9223372036854775806 0    
-2147483648 0W                  0    

// Selecting just int is fine
q)select int from testtab
int                
-------------------
0                  
1                  
32766              
32767              
32768              
2147483646         
2147483647         
2147483648         
9223372036854775805
9223372036854775806
0W                 

// Fine
q)select count expint by int from testtab
int                | expint
-------------------| ------
0                  | 1     
1                  | 1     
32766              | 1     
32767              | 1     
32768              | 1     
2147483646         | 1     
2147483647         | 1     
2147483648         | 1     
9223372036854775805| 1     
9223372036854775806| 1     
0W                 | 1     

// Also fine
q)select first expint by int from testtab
int                | expint             
-------------------| -------------------
0                  | 0                  
1                  | 1                  
32766              | 32766              
32767              | 32767              
32768              | 32768              
2147483646         | 2147483646         
2147483647         | 2147483647         
2147483648         | 2147483648         
9223372036854775805| 9223372036854775805
9223372036854775806| 9223372036854775806
0W                 | 0W

// Not fine
q)select first int by expint from testtab
expint             | int        
-------------------| -----------
0                  | 0          
1                  | 1          
32766              | 32766      
32767              | 32767      
32768              | 32768      
2147483646         | 2147483646 
2147483647         | 2147483647 
2147483648         | -2147483648
9223372036854775805| -2147483648
9223372036854775806| -2147483648
0W                 | -2147483648

// int variable looks fine
q)int
0 1 32766 32767 32768 2147483646 2147483647 2147483648 9223372036854775805 9223372036854775806 0W

// Try similar experiment with int values
q)system "mkdir ../intdb"
q)system "cd ../intdb"
q)string {(` sv .Q.par[`:.;x;`testtab],`) set ([]expint:enlist x)} each 0 1 32766 32767 32768 2147483645 2147483646 2147483647i
":./0/testtab/"
":./1/testtab/"
":./32766/testtab/"
":./32767/testtab/"
":./32768/testtab/"
":./2147483645/testtab/"
":./2147483646/testtab/"
":./0W/testtab/"
q)\l .
q)meta testtab
c     | t f a
------| -----
int   | j    
expint| i    

// Problem still persists for max partition value
q)select from testtab
int        expint    
---------------------
0          0         
1          1         
32766      32766     
32767      32767     
32768      32768     
2147483645 2147483645
2147483646 2147483646
0W         0W        
q)select int, expint from testtab
int         expint    
----------------------
0           0         
1           1         
32766       32766     
32767       32767     
32768       32768     
2147483645  2147483645
2147483646  2147483646
-2147483648 0W        

// all int and expint agree if the max partition is removed
q)system "rm -r 0W"
q)\l .
q)select int, expint from testtab
int        expint    
---------------------
0          0         
1          1         
32766      32766     
32767      32767     
32768      32768     
2147483645 2147483645
2147483646 2147483646

// Repeat the experiment with shorts
q)system "mkdir ../shortdb"
q)system "cd ../shortdb"
q)string {(` sv .Q.par[`:.;x;`testtab],`) set ([]expint:enlist x)} each 0 1 32765 32766 32767h
":./0/testtab/"
":./1/testtab/"
":./32765/testtab/"
":./32766/testtab/"
":./0W/testtab/"
q)\l .
q)meta testtab
c     | t f a
------| -----
int   | j    
expint| h    
q)select int, expint from testtab
int         expint
------------------
0           0     
1           1     
32765       32765 
32766       32766 
-2147483648 0W        
q)system "rm -r 0W"
q)\l .
q)select int, expint from testtab
int   expint
------------
0     0     
1     1     
32765 32765 
32766 32766

// Does deleting the 0W partition solve the issue in the original database?
q)system "cd ../longdb"
q)system "rm -r 0W"
q)\l .

// No
q)select int, expint from testtab
int         expint             
-------------------------------
0           0                  
1           1                  
32766       32766              
32767       32767              
32768       32768              
2147483646  2147483646         
2147483647  2147483647         
-2147483648 2147483648         
-2147483648 9223372036854775805
-2147483648 9223372036854775806

@pmurphy979
Copy link
Contributor Author

maptoint performance testing

// Simulate an IDB written in partbyenum mode

// One date's worth of dummy trade data
q)n:1000000
q)syms:`AAPL`GOOG`IBM`MSFT`YHOO
q)trade:([]time:.z.D+asc n?24:00; sym:n?syms; price:n?100f; size:n?1000i)

// Enumerate table and load sym file
q)trade:.Q.en[`:idb] trade
q)load `:idb/sym

// Write to enumerated partitions
q){(` sv .Q.par[`:idb;`long$sym?x;`trade],`) set select from trade where sym=x} each syms
`:idb/2/trade/`:idb/3/trade/`:idb/0/trade/`:idb/1/trade/`:idb/4/trade/

// Load IDB
q)\l idb
q)select first sym, trades:count sym by int from trade
int| sym  trades
---| -----------
0  | IBM  199827
1  | MSFT 200156
2  | AAPL 199895
3  | GOOG 199878
4  | YHOO 200244

// Current and new versions of maptoint IDB function
q)maptoint:{sym?`TORQNULLSYMBOL^x}
q)maptointnew:{$[(abs type x) in 5 6 7h; 0| 2147483647& `long$ x; sym?`TORQNULLSYMBOL^x]}

// Check all methods return same results
q)count select from trade where sym=`GOOG
199878
q)(select from trade where sym=`GOOG)~select from trade where int=maptoint`GOOG
1b
q)(select from trade where sym=`GOOG)~select from trade where int=maptointnew`GOOG
1b

// Compare performance
q)\ts:1000 select from trade where sym=`GOOG
5036 18877664
q)\ts:1000 select from trade where int=maptoint`GOOG
181 2099216
q)\ts:1000 select from trade where int=maptointnew`GOOG
174 2099216

// Some more testing
q)\ts:1000 select from trade where int=maptoint`GOOG
169 2099216
q)\ts:1000 select from trade where int=maptointnew`GOOG
171 2099216
q)\ts:1000 select from trade where int=maptoint`GOOG
173 2099216
q)\ts:1000 select from trade where int=maptointnew`GOOG
172 2099216
q)\ts:1000 select from trade where int=maptoint`AAPL
176 2099216
q)\ts:1000 select from trade where int=maptointnew`AAPL
170 2099216
q)\ts:1000 select from trade where int=maptointnew`IBM
175 2099216
q)\ts:1000 select from trade where int=maptoint`IBM
192 2099216
q)\ts:100 select first sym, trades:count sym by partition:maptoint sym from trade
692 4206672
q)\ts:100 select first sym, trades:count sym by partition:maptointnew sym from trade
673 4206672

Copy link
Member

@jonathonmcmurray jonathonmcmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pmurphy979 pmurphy979 merged commit ea7cb8c into master Apr 3, 2025
@pmurphy979 pmurphy979 deleted the partbyenum-for-ints branch April 3, 2025 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants