Skip to content

Conversation

@pmurphy979
Copy link
Contributor

Fixes #692

  • Moved the applysortandattr call from inside the .replay.replaylog loop to immediately after it, so that sorting happens the minimal number of times (once per table partition) rather than once per log file.
  • Also needed to move the .replay.pathlist definition from inside the loop to immediately before it, so it doesn't reset between log files. If TP logs are more granular than the HDB partition type (e.g. hourly vs daily) then .replay.pathlist will have duplicate partition paths per table, but applysortandattr already handles this by doing distinct each pathlist.

@pmurphy979
Copy link
Contributor Author

Checking that tickerlog unit tests still pass

Before change: all tests passing

$ bash tests/stp/tickerlog/run.sh -d
...
q)count KUTR
68

q)exec count i by ok, okms, okbytes, valid from KUTR
ok okms okbytes valid|   
---------------------| --
1  1    1       1    | 68

q)exec avg msx from KUTR
6.220588

After change: all tests still passing, slight decrease in average execution time

(these tests only use trade and quotes tables with 20 rows, so need a better performance test)

$ bash tests/stp/tickerlog/run.sh -d
...
q)count KUTR
68

q)exec count i by ok, okms, okbytes, valid from KUTR
ok okms okbytes valid|   
---------------------| --
1  1    1       1    | 68

q)exec avg msx from KUTR
5.926471

@pmurphy979
Copy link
Contributor Author

Performance test

67% faster execution on the same set of hourly-segmented log files for 1M trades and 5M quotes:

(see below for script definitions and test data generation)

$ time run_old.sh
real    0m36.950s
user    0m13.341s
sys     0m10.370s

$ time run_new.sh
real    0m12.245s
user    0m10.406s
sys     0m1.283s

run_old.sh - runs old process code

testpath=${KDBTESTS}/tickerlogreplay
${RLWRAP} ${QCMD} ${TORQHOME}/torq.q \
  -proctype tickerlogreplay -procname tplogreplay1 \
  -load ${KDBCODE}/processes/tickerlogreplay_old.q \
  -.replay.schemafile ${testpath}/database.q \
  -.replay.tplogdir ${testpath}/tplogs \
  -.replay.hdbdir ${testpath}/hdb1

run_new.sh - runs new process code and writes to separate HDB

testpath=${KDBTESTS}/tickerlogreplay
${RLWRAP} ${QCMD} ${TORQHOME}/torq.q \
  -proctype tickerlogreplay -procname tplogreplay2 \
  -load ${KDBCODE}/processes/tickerlogreplay.q \
  -.replay.schemafile ${testpath}/database.q \
  -.replay.tplogdir ${testpath}/tplogs \
  -.replay.hdbdir ${testpath}/hdb2

${testpath}/database.q

trade:([]time:`timestamp$(); sym:`symbol$(); price:`float$(); size:`int$())
quote:([]time:`timestamp$(); sym:`symbol$(); bid:`float$(); ask:`float$(); bsize:`int$(); asize:`int$())

${testpath}/tplogs - generated by the following script:

// Make a test directory of trade and quote tickerplant logs, segmented hourly
stpprocname:`stp1  // dummy stp process name to include in log file names
nt:1000000         // number of trades
nq:5000000         // number of quotes
syms:`AAPL`GOOG`IBM`MSFT`YHOO
// One date's worth of dummy trade and quote data
trade:([]time:.z.D+asc nt?24:00; sym:nt?syms; price:nt?100f; size:nt?1000i)
quote:([]time:.z.D+asc nq?24:00; sym:nq?syms; bid:nq?100f; ask:nq?100f; bsize:nq?1000i; asize:nq?1000i)
// Minimal stpmeta columns needed by tickerlogreplay process - this table is filled below
stpmeta:([]logname:`$();tbls:())
// Function to write a single logfile to a tplogs directory (seg is a timestamp, e.g. hour bucket)
writelog:{[tabname;seg;data]
  logfile: hsym `$ "tplogs/", string[stpprocname], "_", string[tabname], -9_ except[;".D:"] string seg;
  logfile set ();
  h:hopen logfile;
  h data;
  hclose h;
  `stpmeta upsert (logfile; enlist tabname);
  logfile
 }
// Write hourly trade and quote log files (2*24=48 log files in total)
writelog[`trade] ./: flip (key;value) @\: exec enlist[`upd;`trade;] each flip (time;sym;price;size) by 0D01:00 xbar time from trade;
writelog[`quote] ./: flip (key;value) @\: exec enlist[`upd;`quote;] each flip (time;sym;bid;ask;bsize;asize) by 0D01:00 xbar time from quote;
// Write stpmeta table to same directory
`:tplogs/stpmeta set stpmeta;

Confirming correct sorting and attributes are applied in ${testpath}/hdb2:

(confirmed same results in ${testpath}/hdb1)

$q hdb2

q)count trade
1000000

q)meta trade
c    | t f a
-----| -----
date | d    
time | p    
sym  | s   p
price| f    
size | I

q)select attr sym by date from trade
date      | sym
----------| ---
2025.02.10| p

q)select time~asc time by date, sym from trade
date       sym | time
---------------| ----
2025.02.10 AAPL| 1   
2025.02.10 GOOG| 1   
2025.02.10 IBM | 1   
2025.02.10 MSFT| 1   
2025.02.10 YHOO| 1

q)count quote
5000000

q)meta quote
c    | t f a
-----| -----
date | d    
time | p    
sym  | s   p
bid  | f    
ask  | f    
bsize| i    
asize| i    

q)select attr sym by date from quote
date      | sym
----------| ---
2025.02.10| p

q)select time~asc time by date, sym from quote
date       sym | time
---------------| ----
2025.02.10 AAPL| 1   
2025.02.10 GOOG| 1   
2025.02.10 IBM | 1   
2025.02.10 MSFT| 1   
2025.02.10 YHOO| 1

@pmurphy979 pmurphy979 merged commit 1ee63fa into master Feb 12, 2025
@jonathonmcmurray jonathonmcmurray deleted the 692-tplog-replay-sorts-every-period branch February 19, 2025 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tplog replay sorts after every period for segmented logs

3 participants