Skip to content

Conversation

@f3sch
Copy link
Collaborator

@f3sch f3sch commented Jan 11, 2024

This patch carefully checks which tracks are associated with a V0 and flags to not be discarded. This is written such that the thinner is unaware of the track-type. Additionally, this patch further generalizes the thinner by making it check manually for the table_name with regex. E.g., previously the v0-name was specifically ‘O2v0_001’, now this has been changed to ’O2v0_00?’. That way, we no longer care about the actual version contained in the table. Hence, new AO2Ds can also be thinned.

Additionally, I added short_options and a flag to protect against overwriting existing output files (mostly for my own sanity). Also I added some printing to compare the file sizes before and after thinning.

I manually sporadically scanned the tree entries of O2v0 and O2trackextra and compared if the prong indices correspond to the same tracks. The thinning w/o TPC tracks was space Saving ~69.1% and w TPC tracks it was ~68%. The timing increased by ~6s so ~10%. I checked one AO2D from the PbPb apass1.

@jgrosseo open to suggestions and to trim this PR down if needed.

@f3sch
Copy link
Collaborator Author

f3sch commented Jan 18, 2024

As additional validation scenarios considered:

  1. Does the PR reproduce the same behaviour as before (compared to dev).
  2. Does the same hold true for new files.
  3. Does the same hold true for all 'older' files but with '-K' running
  4. Does the PR produce valid files AO2Ds w TPC prongs assoc. to V0s including a respective example analysis.

For scenario one, I was kindly provided with one AO2D from the planned 2022 pp dataset (main target of thinning, /alice/data/2022/LHC22f/520473/apass4/1740/o2_ctf_run00520473_orbit0730307502_tf0000000009_epn093/001/AO2D.root). As a first easy check I dumped stdout to log files (% the added output) and diffed, which revealed that there are no changes. Indicating that A. exactly the same number of entries are written for normal trees and B. that this hold true for reduced trees. Additionally, I checked on the tree-level of one arbitrary DataFrame (also for others), here an example comparing the two trees (done also for O2v0_001 and O2track_iu):
image

For scenario two I took one AO2D from the recent pbpb apass1 removal and adjusted the tree names on dev accordingly. The output is also identical (when running w/o '-K').

For scenario three I took the same file as for scenario one but ran with '-K' on this PR. There are no TPC only V0 prongs in this data set, so the expectation is that an identical file w/o '-K' is being created. This is true.

In addition to all these scenarios I investigate more in-depth to ensure proper backwards compatibility by using this script:

#!/usr/bin/env python3

import ROOT
import numpy as np
from tqdm import tqdm
from pprint import pprint

# parse all DFs
dirs = []
f = ROOT.TFile("AO2D.root", "READ")
for key in f.GetListOfKeys():
    if key.GetClassName() == 'TDirectoryFile':
        dirName = key.ReadObj().GetName()
        if dirName.startswith("DF_"):
            dirs.append(dirName)
print(f"Found DFs: {len(dirs)}")
if len(dirs) < 1:
    print("ERROR: No DFs found!")
    exit(-2)

# parse all trees
trees = []
if not f.cd(dirs[0]):
    print(f"ERROR: Cd-ing in {dirs[0]}")
    exit(-3)
for kTree in ROOT.gDirectory.GetListOfKeys():
    if kTree.GetClassName() == "TTree":
        trees.append(kTree.GetName())

print(f"Found Trees: {len(trees)}")


errors_compare= dict.fromkeys(trees, 0)
errors_pyroot = dict.fromkeys(trees, 0)

for (df, tree) in tqdm([(df, tree) for df in dirs for tree in trees]):
    try:
        rdf1 = ROOT.RDataFrame(f"{df}/{tree}", "AO2D_thin_dev.root")
        rdf2 = ROOT.RDataFrame(f"{df}/{tree}", "AO2D_thin_test.root")

        npArr1 = rdf1.AsNumpy()
        npArr2 = rdf2.AsNumpy()

        for k in npArr2.keys():
            if not np.array_equal(npArr1[k], npArr2[k]):
                errors_compare[trees] += 1
    except:
        errors_pyroot[tree] += 1

pprint(errors_pyroot)
pprint(errors_compare)

This scans the original AO2D.root file for all DF and all included Tree names, then it goes through all DFs in the dev-thinned and PR-thinned versions and compares the resulting trees 1:1. Due to a limitation AFAIK due to C Classes to pyroot-numpy arrays this does not always succeed, meaning some trees are incomparable (not meaning they are different on disk, just that this type of comparison does not work for them). The excluded trees are 'O2fwdtrackcov', 'O2ft0', 'O2fv0a', 'O2trackcov_iu' and 'O2trackextra'. The latter one being the most meaningful one but alas nothing to be done... All other trees are IMO validated by this comparison.

The last scenario will follow.

@f3sch
Copy link
Collaborator Author

f3sch commented Jan 20, 2024

Ok last part, I found a better way to compare the resulting AO2Ds in #12039, posting here for due to slight modification and bookkeeping but all credit goes to the author of the mentioned PR.

void GetMinMax(TH1 *h, Double_t &ymin, Double_t &ymax) {
  ymax = -1e30;
  ymin = 1e30;
  for (Int_t i = 1; i <= h->GetNbinsX(); i++) {
    Double_t y = h->GetBinContent(i);
    ymax = TMath::Max(ymax, y);
    if (y > 0)
      ymin = TMath::Min(ymin, y);
  }
}

Bool_t LogScale(TH1 *h) {
  Double_t ymin, ymax;
  GetMinMax(h, ymin, ymax);
  if (ymin > ymax)
    return false;
  return (ymax > ymin * 1000);
}

void CompareTrees(TTree *mine, TTree *dev, TDirectory *ret, TCanvas *canvas,
                  TString df) {

  TDirectory *sub = ret->mkdir(Form("%s_%s", df.Data(), mine->GetName()));

  auto mineL = mine->GetListOfBranches();
  Int_t ncol = mineL->GetEntriesFast() > 10 ? 3 : 2;
  Int_t nrow = (mineL->GetEntriesFast() + ncol - 1) / ncol;
  canvas->Clear();
  canvas->Divide(ncol, nrow);
  canvas->Draw();

  TVirtualPad *pad = 0;
  Int_t ipad = 1;

  Int_t mineN = mine->GetEntries();
  Int_t devN = dev->GetEntries();
  if (mineN != devN) {
    Fatal(mine->GetName(), "Inconsistent selected %d (mine) vs %d (dev)", mineN,
          devN);
  }

  sub->cd();
  for (auto ob : *mineL) {
    TBranch *mineB = static_cast<TBranch *>(ob);

    TObject *db = dev->GetListOfBranches()->FindObject(mineB->GetName());
    if (not db) {
      Fatal(mineB->GetName(), "Branch not in dev %s", dev->GetName());
      continue;
    }

    TBranch *devB = static_cast<TBranch *>(db);

    Info("", " %s/%s", mine->GetName(), mineB->GetName());

    mine->Draw(Form("%s>>hmine", mineB->GetName()), "", "goff");
    dev->Draw(Form("%s>>hdev", devB->GetName()), "", "goff");

    mineN = mine->GetSelectedRows();
    devN = dev->GetSelectedRows();
    Double_t *mineX = mine->GetV1();
    Double_t *devX = dev->GetV1();
    Int_t nDiff = 0;
    Int_t diff1 = -1;
    for (size_t i = 0; i < TMath::Min(mineN, devN); i++) {
      if (TMath::Abs((mineX[i] - devX[i]) / devX[i]) > 0.0001) {
        nDiff++;
        // Info(mineB->GetName(), "    %8d: %g  vs  %g",
        //      i, mineX[i], devX[i]);
        if (diff1 >= 0)
          continue;
        diff1 = i;
      }
    }
    if (nDiff > 0)
      Fatal(mineB->GetName(), "%d values differ, starting at %d", nDiff, diff1);

    TH1 *mineH = static_cast<TH1 *>(sub->Get("hmine"));
    TH1 *devH = static_cast<TH1 *>(sub->Get("hdev"));

    if ((mineH and not devH) or (not mineH and devH)) {
      Fatal(mineB->GetName(), "One histogram is missing %p %p", mineH, devH);
      continue;
    }
    if (not mineH and not devH) {
      // Warning(mineB->GetName(),"No histograms");
      continue;
    }

    mineH->SetName(mineB->GetName());
    mineH->SetTitle("PR");
    mineH->SetLineColor(kRed + 1);
    mineH->SetFillColor(kRed + 1);
    mineH->SetMarkerColor(kRed + 1);
    mineH->SetMarkerStyle(20);
    devH->SetName(devB->GetName());
    devH->SetTitle("Dev");
    devH->SetLineColor(kBlue + 1);
    devH->SetFillColor(kBlue + 1);
    devH->SetMarkerColor(kBlue + 1);
    devH->SetMarkerStyle(25);
    devH->SetMarkerSize(1.2);

    THStack *stack = new THStack(mineB->GetName(), mineB->GetName());
    stack->Add(mineH);
    stack->Add(devH);
    stack->Write();

    pad = canvas->cd(ipad);
    stack->Draw("no stack ep");
    TH1 *h = stack->GetHistogram();
    if (LogScale(mineH) or LogScale(devH))
      pad->SetLogy();

    pad->BuildLegend();
    ipad++;
  }
}

void CompareAO2D() {
  TFile *origFile = TFile::Open("AO2D.root", "READ");
  TFile *mineFile = TFile::Open("AO2D_thin_test.root", "READ");
  TFile *devFile = TFile::Open("AO2D_thin_dev.root", "READ");
  TFile *output = TFile::Open("compare.root", "RECREATE");

  TCanvas *canvas = new TCanvas("compare", "Compare AODs", 600, 800);
  canvas->Print(Form("%s.pdf[", canvas->GetName()));

  int nDF{0};

  for (auto keyDF : *origFile->GetListOfKeys()) {
    TString df = static_cast<TKey *>(keyDF)->ReadObj()->GetName();
    if (!df.BeginsWith("DF_")) {
      continue;
    }
    TDirectory *mineDF = mineFile->GetDirectory(df);
    TDirectory *devDF = devFile->GetDirectory(df);
    if (!mineDF || !devDF) {
      Fatal("", "%s not in one File!", df.Data());
    }
    Printf("Analyzing %s", mineDF->GetName());
    for (auto key : *mineDF->GetListOfKeys()) {
      TObject *keyTree = static_cast<TKey *>(key)->ReadObj();
      if (!keyTree->InheritsFrom(TTree::Class())) {
        continue;
      }
      Info("", "%s", key->GetName());
      TObject *mineO = static_cast<TKey *>(key)->ReadObj();
      if (not mineO) {
        Fatal(key->GetName(), "Tree null");
      }
      TObject *devO = devDF->Get(mineO->GetName());
      if (not devO) {
        Fatal(mineO->GetName(), "Object not found in dev output");
      }
      if (TString name = key->GetName(); name == "O2trackqa"){
        continue; // trackqa is sampled, e.g., not reproducible
      }

      TTree *mineT = static_cast<TTree *>(mineO);
      TTree *devT = static_cast<TTree *>(devO);

      CompareTrees(mineT, devT, output, canvas, df);

      canvas->Print(Form("%s.pdf", canvas->GetName()),
                    Form("Title: %s in %s", mineT->GetName(), df.Data()));
    }
    if (nDF++ > 10) {
      break;
    }
  }
  canvas->Print(Form("%s.pdf]", canvas->GetName()));
  output->Write();
}

The resulting report:
compare.pdf
which confirms that for scenario 1-3 the resulting thinned AO2Ds are indeed identical.

For scenario four I did a test on a MC production with mostly photons (most V0 prongs are TPConly). I ran the analysis on the original-unthinned AO2D and then based on this PR thinned version including TPConly prongs. The results are identical, thus showing that the correct V0 tracks are kept in the AO2D + all original kept tracks.
c1

@f3sch f3sch marked this pull request as ready for review January 20, 2024 14:33
@f3sch f3sch requested a review from a team as a code owner January 20, 2024 14:33
@f3sch f3sch requested a review from jgrosseo January 20, 2024 14:33
Copy link
Collaborator

@jgrosseo jgrosseo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this extensive work and the major testing effort.

Please find my mostly minor comments here.

@f3sch f3sch force-pushed the aod/thinner branch 2 times, most recently from 28e388e to 8ff247a Compare January 22, 2024 10:01
@f3sch
Copy link
Collaborator Author

f3sch commented Jan 22, 2024

Also another question I had; Is it save to make keeping TPC v0 prongs optional, okay in 2022 there are non anyways but if this is ever applied in to future datasets. Then the AO2Ds are not valid in the sense that there will be V0 with TPConly prongs which would be thrown away unless one explicitly specifies '-K'. IMHO, I would drop the flag.

@jgrosseo
Copy link
Collaborator

Also another question I had; Is it save to make keeping TPC v0 prongs optional, okay in 2022 there are non anyways but if this is ever applied in to future datasets. Then the AO2Ds are not valid in the sense that there will be V0 with TPConly prongs which would be thrown away unless one explicitly specifies '-K'. IMHO, I would drop the flag.

Yes, I agree. If we want to keep the flag one would need to remove those V0s which is additional further work. Better to drop the flag for now.

This patch carefully checks which tracks are associated with a V0 and flags to
not be discarded. This is written such that the thinner is unaware of the track-type.
Additionally, this patch further generalizes the thinner by making it check
manually for the table_name with regex. E.g., previously the v0-name was
specifically ‘O2v0_001’, now this has been changed to ’O2v0_00?’. That way, we
no longer care about the actual version contained in the table. Hence, new AO2Ds
can also be thinned.

Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI for 8ff247a at 2024-01-22 15:28:

## sw/BUILD/O2-latest/log
c++: error: unrecognized command-line option '--rtlib=compiler-rt'
c++: error: unrecognized command-line option '--rtlib=compiler-rt'


## sw/BUILD/O2-full-system-test-latest/log
task timeout reached .. killing all processes


## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
++ GRERR=1
++ [[ 1 == 0 ]]
++ mkdir -p /sw/INSTALLROOT/c7e8ff0847e6bb081bbc8f0ae4d31654b3200124/slc8_x86-64/o2checkcode/1.0-local43/etc/modulefiles
++ cat
--

Full log here.

@f3sch f3sch merged commit 750ec55 into AliceO2Group:dev Jan 22, 2024
andreasmolander pushed a commit to andreasmolander/AliceO2 that referenced this pull request Apr 12, 2024
This patch carefully checks which tracks are associated with a V0 and flags to
not be discarded. This is written such that the thinner is unaware of the track-type.
Additionally, this patch further generalizes the thinner by making it check
manually for the table_name with regex. E.g., previously the v0-name was
specifically ‘O2v0_001’, now this has been changed to ’O2v0_00?’. That way, we
no longer care about the actual version contained in the table. Hence, new AO2Ds
can also be thinned.

Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
andreasmolander pushed a commit to andreasmolander/AliceO2 that referenced this pull request Apr 12, 2024
This patch carefully checks which tracks are associated with a V0 and flags to
not be discarded. This is written such that the thinner is unaware of the track-type.
Additionally, this patch further generalizes the thinner by making it check
manually for the table_name with regex. E.g., previously the v0-name was
specifically ‘O2v0_001’, now this has been changed to ’O2v0_00?’. That way, we
no longer care about the actual version contained in the table. Hence, new AO2Ds
can also be thinned.

Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
mwinn2 pushed a commit to mwinn2/AliceO2 that referenced this pull request Apr 25, 2024
This patch carefully checks which tracks are associated with a V0 and flags to
not be discarded. This is written such that the thinner is unaware of the track-type.
Additionally, this patch further generalizes the thinner by making it check
manually for the table_name with regex. E.g., previously the v0-name was
specifically ‘O2v0_001’, now this has been changed to ’O2v0_00?’. That way, we
no longer care about the actual version contained in the table. Hence, new AO2Ds
can also be thinned.

Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
@f3sch f3sch deleted the aod/thinner branch November 25, 2024 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants