diff --git a/code/processes/idb.q b/code/processes/idb.q index b566da0bc..844a6c4e8 100644 --- a/code/processes/idb.q +++ b/code/processes/idb.q @@ -113,3 +113,8 @@ maptoint:{[val] /- if using a symbol column, enumerate against the hdb sym file sym?`TORQNULLSYMBOL^val] }; + +/- helper function to support queries against the sym column in partbyfirstchar +mapfctoint:{[val] + .Q.an?$[0system"s"; + [.lg.o[`sortandmerge;"sorting on worker sort", string .z.p]; + {(neg x)(`.wdb.reloadsymfile;y);(neg x)(::)}[;.Q.dd[hdbsettings `hdbdir;`sym]] each .z.pd[]; + {[x;compression] setcompression compression;.sort.sorttab x;if[gc;.gc.run[]]}[;hdbsettings`compression] peach tnds]; + [.lg.o[`sort;"sorting on main sort"]; + reloadsymfile[.Q.dd[hdbsettings `hdbdir;`sym]]; + {[x] .sort.sorttab[x];if[gc;.gc.run[]]} each tnds]]; + .lg.o[`sort;"finished sorting data"]; + endofdaymerge[dir;pt;tablist;mergelimits;hdbsettings;mergemethod;writedownmode]; + }; + /- end of day sort [depends on writedown mode] endofdaysort:{[dir;pt;tablist;writedownmode;mergelimits;hdbsettings;mergemethod] /- set compression level (.z.zd) setcompression[hdbsettings[`compression]]; $[writedownmode in partwritemodes; - endofdaymerge[dir;pt;tablist;mergelimits;hdbsettings;mergemethod;writedownmode]; + $[writedownmode~`partbyfirstchar; /-partbyfirstchar will not be sorted by sym within each parition, this needs done first + endofdaysortandmerge[dir;pt;tablist;mergelimits;hdbsettings;mergemethod;writedownmode]; + endofdaymerge[dir;pt;tablist;mergelimits;hdbsettings;mergemethod;writedownmode]]; endofdaysortdate[dir;pt;key tablist;hdbsettings] ]; /- run steps to rollover idb @@ -534,7 +563,7 @@ fixpartition:{[subto] ]; } -/- for writedown modes partbyenum/default we make sure that partition 0/currentpartition has all the tables. +/- for writedown modes partbyenum/partbyfirstchar/default we make sure that partition 0/currentpartition has all the tables. /- In that case we can use .Q.chk later to fill the db making it useable for intraday processes /- pt - date; partition for which the function should initialise initmissingtables:{[pt] @@ -550,7 +579,7 @@ filldb:{[pt] /- initialises table t in db with its schema in part inittable:{[t;pt] - tabledir:` sv $[writedownmode~`partbyenum; .Q.par[.Q.dd[hsym savedir;pt];0;t]; .Q.par[hsym savedir;pt;t]],`; + tabledir:` sv $[writedownmode in `partbyenum`partbyfirstchar; .Q.par[.Q.dd[hsym savedir;pt];0;t]; .Q.par[hsym savedir;pt;t]],`; if[() ~ key tabledir;tabledir set .Q.en[hsym hdbdir;0#value t]]; } @@ -590,7 +619,7 @@ getsortparams:{[] /- get the attributes csv file /-even if running with a sort process should read this file to cope with backups .sort.getsortcsv[sortcsv]; - /- check the sort.csv for parted attributes `p if the writedownmode `partbyattr or `partbyenum is selected + /- check the sort.csv for parted attributes `p if the writedownmode `partbyattr, `partbyenum or `partbyfirstchar is selected /- if each table does not have at least one `p attribute the process will exit if[writedownmode in partwritemodes; @@ -612,7 +641,7 @@ getsortparams:{[] /- If the function is ran on sort process send initmissingtables command to wdbs idbreload:{[pt] .lg.o[`idb;"starting idb reload"]; - if[writedownmode in `partbyenum`default; + if[writedownmode in `partbyenum`default`partbyfirstchar; .lg.o[`eod;"initialising wdbhdb for partition: ",string[pt]]; $[.proc.proctype~`sort;{[pt]ws:exec w from .servers.getservers[`proctype;wdbtypes;()!();1b;0b];{[ws;pt]ws(`.wdb.initmissingtables;[pt])}[;pt] each ws}[pt];initmissingtables[pt]]; .lg.o[`eod;"notifying idbs for newly created partition"]; diff --git a/docs/Processes.md b/docs/Processes.md index b25ea2750..955de3f14 100755 --- a/docs/Processes.md +++ b/docs/Processes.md @@ -1035,6 +1035,24 @@ sorting at the end of the day. In the above example, the data is parted by sym, and number 456 is the order of MSFT_N symbol entry in the HDB sym file. +- partbyfirstchar - Data is persisted to a partition scheme where the partition + is derived from the first character in sym colum present in the sort.csv + file. Like partbyenum, this can be only be done by one column which has the + parted attribute applied to it. It must be a symbol column due the nature + of the character extraction. The numerical value for characters will map + to the index of the character in the .Q.an. For those that arent contained i + within .Q.an, they will map to the count of .Q.an. Partitioning in this way + means that the data within each partition is not sorted for the parted + attribute to be applied, which means in the EOD process the data needs sorted + before being merged. This sort happens partition by partition rather than + as a whole. The wdb partition scheme is of the form + \[wdbdir\]/\[partitiontype\]/\[first char index .Q.an\]/\[table(s)\]/ + A typical partition directory would be similar to (for ex sym: MSFT_N) + wdb/database/2025.11.04/38/trade + In the above example, the data is parted by sym, and number 38 is the + index position of M in .Q.an. + + The advantage of partbyenum over partbyattr could be that the directory structure it uses represents a HDB that is ready to be loaded intraday. At the end of the day the data gets upserted to the HDB the @@ -1046,7 +1064,10 @@ data sets with a low cardinality (ie. small number of distinct elements) the optional method may provide a significant time saving, upwards of 50%. The optional method should also reduce the memory usage at the end of day event, as joining data is generally less memory intensive than -sorting. +sorting. The optional partbyfirstchar method allows a method for subdividing +data with a high cardinality to reduce the number of partitions being +written to, while providing a means for reduced memory footprint on final sort +versus default. @@ -1056,13 +1077,19 @@ Intraday Database (IDB) The Intraday Database or IDB is a simple process that allows access to data written down intraday. This assumes that there is an existing WDB (and HDB) process creating a DB on disk that can be loaded with a simple -load command. As of now default and partbyenum WDB writedown modes are supported. -The responsibility of an IDB is therefore: +load command. As of now default, partbyenum and partbyfirstchar WDB writedown +modes are supported. The responsibility of an IDB is therefore: 1. Serving queries. Since partbyenum writedown mode is done by enumerated symbol columns a helper function maptoint is implemented to support symbol lookup in sym file: select from trade where int=maptoint[`MSFT_N] + Also with partbyfirstchar being an alternate approach to create a + numerical partition, there is a helper function to locate the correct + value: + select from trade where int=mapfctoint[`MSFT],sym=`MSFT + select from trade where int in mapfctoint[`MSFT`AAPL],sym in `MSFT`AAPL + 2. Can be triggered for a reload. This is usually done by the WDB process periodically. @@ -1097,6 +1124,10 @@ The IDB can be queried just like any other HDB. If writedown mode partbyenum is ``` neg[gwHandle](`.gw.asyncexec;"select from trade where int=maptoint[`GOOG]";`idb);gwHandle[] ``` +Likewise if partbyfirstchar writedown mode is used there is a "mapfctoint" which can be used +``` +neg[gwHandle](`.gw.asyncexec;"select from trade where int in maptoint[`GOOG`MSFT],sym in `GOOG`MSFT";`idb);gwHandle[] +``` ### Scalability