Rebase AWS dev on top of current ROOT master by vepadulano · Pull Request #3 · CloudPyRDF/root

vepadulano · 2021-11-15T11:13:47Z

Lately there have been many changes in DistRDF in master. I suggest this fork is rebased with upstream. This PR is just a reminder

- Remove unused `import pty` - Disable stream capture (currently not working on Windows). The output coming from C++ (ROOT) will only print in the command prompt and not in the notebook (to be investigated) - Disable tab completion (currently not working on Windows). - Add the `delete=False` argument in `NamedTemporaryFile()` and manually delete the created file to prevent the following error: ``` Error in callback <bound method CaptureDrawnPrimitives._post_execute of <JupyROOT.helpers.utils.CaptureDrawnPrimitives object at 0x0C321C58>> (for post_execute): --------------------------------------------------------------------------- PermissionError Traceback (most recent call last) ~\build\release\bin\JupyROOT\helpers\utils.py in _post_execute(self) 461 462 def _post_execute(self): --> 463 NotebookDraw() 464 465 def register(self): ~\build\release\bin\JupyROOT\helpers\utils.py in NotebookDraw() 450 def NotebookDraw(): 451 DrawGeometry() --> 452 DrawCanvases() 453 DrawRCanvases() 454 ~\build\release\bin\JupyROOT\helpers\utils.py in DrawCanvases() 441 drawers = GetCanvasDrawers() 442 for drawer in drawers: --> 443 drawer.Draw() 444 445 def DrawRCanvases(): ~\build\release\bin\JupyROOT\helpers\utils.py in Draw(self) 598 599 def Draw(self): --> 600 self._display() 601 return 0 602 ~\build\release\bin\JupyROOT\helpers\utils.py in _display(self) 583 self._jsDisplay() 584 else: --> 585 self._pngDisplay() 586 587 def GetDrawableObjects(self): ~\build\release\bin\JupyROOT\helpers\utils.py in _pngDisplay(self) 572 573 def _pngDisplay(self): --> 574 img = self._getPngImage() 575 IPython.display.display(img) 576 ~\build\release\bin\JupyROOT\helpers\utils.py in _getPngImage(self) 566 with _setIgnoreLevel(ROOT.kError): 567 self.drawableObject.SaveAs(ofile.name) --> 568 img = IPython.display.Image(filename=ofile.name, format='png', embed=True) 569 #ofile.close() 570 #os.unlink(ofile.name) c:\python38-32\lib\site-packages\IPython\core\display.py in __init__(self, data, url, filename, format, embed, width, height, retina, unconfined, metadata) 1229 self.retina = retina 1230 self.unconfined = unconfined -> 1231 super(Image, self).__init__(data=data, url=url, filename=filename, 1232 metadata=metadata) 1233 c:\python38-32\lib\site-packages\IPython\core\display.py in __init__(self, data, url, filename, metadata) 635 self.metadata = {} 636 --> 637 self.reload() 638 self._check_data() 639 c:\python38-32\lib\site-packages\IPython\core\display.py in reload(self) 1261 """Reload the raw data from file or URL.""" 1262 if self.embed: -> 1263 super(Image,self).reload() 1264 if self.retina: 1265 self._retina_shape() c:\python38-32\lib\site-packages\IPython\core\display.py in reload(self) 660 """Reload the raw data from file or URL.""" 661 if self.filename is not None: --> 662 with open(self.filename, self._read_flags) as f: 663 self.data = f.read() 664 elif self.url is not None: PermissionError: [Errno 13] Permission denied: 'C:\\Users\\sftnight\\AppData\\Local\\Temp\\tmpk49jx00w.png' ``` For the details, see the documentation of `NamedTemporaryFile` in [tempfile](https://docs.python.org/3/library/tempfile.html): ``` Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows NT or later). ```

…elements from collections AddCollectionToRow() modified so that the first 10 elements of long collections are to be printed, then followed by "...". Previously, only the first and last elements of collections of size >=3 were printed. Print() logic is now simplified, since after "..." there is no guarantee that there is going to be a next element. Additional parameter to the constructor of the the RDisplay() was added, that is the nMaxCollectionElements. It determines, number of elements to be shown from long collection. Default is 10. Corresponding tests were adapted and additional ones were added to check the new functionality. Regarding root-project#6981 - to display the whole collection, please use AsString(). Co-authored-by: Enrico Guiraud <enrico.guiraud@cern.ch>

…#9113)

…ms and dlsym.

Instead of suppressing the emission of weak symbols that have an existing in-process definition, remove the definition of the weak symbol from the module. This prevents relocation "surprises" as seen e.g. on Apple M1, where weak symbol definitions are expected to be found within the module, and relocations are thus expected to be local. By removing the definition, the symbol becomes a reference (declaration) and relocation will work as for any non-local symbol. Thus also reduces the amount of duplicate work, as in-process weak symbols will now be suppressed early.

Fixes test failures (crashes) in projectroot.test.test_stressiterators_interpreted projectroot.roottest.cling.dict.roottest_cling_dict_assertTmpltDefArgCtor projectroot.test.test_stresshistogram_interpreted projectroot.roottest.root.meta.roottest_root_meta_runnamespace_auto

This is now hit for variables on Windows (which don't see the GlobalVariable case). Before the `#ifndef WIN32`, this case was not hit by roottest, neither on Windows nor macOS nor Linux - so the coverage of "make it a declaration" is good.

(cherry picked from commit 5af1fa4)

Co-authored-by: Vincenzo Eduardo Padulano <v.e.padulano@gmail.com>

This reverts commit a9ef733.

…#9125)

* Fix "an histogram" typos in TH1 documentation * Small grammar fixes in TH1 documentation

- remove useless `ioctlsocket` on a file decriptor which is not a socket. See the Microsoft [_pipe documentation](https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/pipe?view=msvc-160): > A pipe resembles a file because it has a file pointer, a file descriptor, or both, and it can be read from or written to by using the Standard Library input and output functions. However, a pipe does not represent a specific file or device. Instead, it represents temporary storage in memory that is independent of the program's own memory and is controlled entirely by the operating system. - use a large buffer and use `_fstat` on the file decriptor to check if there is something to read from the pipe. This to prevent blocking. See also the Microsoft [_pipe documentation](https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/pipe?view=msvc-160): > All read and write operations on the pipe wait until there is enough data or enough buffer space to complete the I/O request.

Following the last commit, the `StreamCapture()` can now be enabled on Windows

Convert all backslashes into forward slashes. This fixes the following startup exception, due to mixing forward and back slashes: ``` [E 10:34:29.657 NotebookApp] Exception while loading config file C:\Users\sftnight\.rootnb\jupyter_notebook_config.py Traceback (most recent call last): File "C:\Python38-32\lib\site-packages\traitlets\config\application.py", line 738, in _load_config_files config = loader.load_config() File "C:\Python38-32\lib\site-packages\traitlets\config\loader.py", line 614, in load_config self._read_file_as_dict() File "C:\Python38-32\lib\site-packages\traitlets\config\loader.py", line 646, in _read_file_as_dict exec(compile(f.read(), conf_filename, 'exec'), namespace, namespace) File "C:\Users\sftnight\.rootnb\jupyter_notebook_config.py", line 6 c.NotebookApp.extra_static_paths = ['C:/Users/sftnight/build/release\js\'] ```

In df1a5a6, a memory leak when reading RooRealVars from ROOT files was fixed by introducing a `RooRealVar::_sharedPropList` map that holds `std::weak_ptr`s to all shared properties objects that were created by any RooRealVar. However, a mechanism to cleanup the weak pointers from the map when they were explired was still missing, which caused the `RooRealVar::_sharedPropList` map to grow forever and by a lot, especially in toy studies where many datasets are created (in the RooFit dataset constructors, RooAbsReals are cloned, which triggers the creation of the shared properties object). In this commit, the `RooRealVar::_sharedPropList` map is cleaned from expiring pointers in the RooRealVar destructor, which should fix a bunch of RooFit problems that seemed like memory leaks. To avoid the static initialization order fiasko, the `_sharedPropList` is now allocated on the heap and only deleted when the process exits using the `RooSentinel` mechanism that is already used for the RooFit memory pools.

The two classes that derive from `RooSharedProperties` are `RooRealVarSharedProperties` and `RooCategorySharedProperties`. For both of them, copying is discouraged because it makes no sense to copy a shared object like this (instead, one should copy `std::shared_ptr`s to that object). To warn users who accidentally copied a shared properties object, there were some print outs in the copy constructors of `RooRealVarSharedProperties` and `RooCategorySharedProperties`, but it is better to delete the copy constructor/assignment operator to begin with. This is suggested in this commit. The commit further makes some code slimplification by removing the source files for the shared properties object and inlining everyithing in the header, at the cost only and added inclusion of `RooAbsBinning.h` in `RooRealVarSharedProperties.h`. This is acceptable, because `RooRealVarSharedProperties.h` is an implementation detail of `RooRealVar` and only included in `RooRealVar.cxx`.

should have no effect as it will be anyway overridden by whatever CodeGen has.

Fixes CodeGen for opt levels > 0.

This is hit by stressShapes.cxx in WIN32 where `vshape[ivol-1]` gets accessed. It is a broken assert, as relocation addresses can point outside the section as long as the calculated address points inside the section. This is https://bugs.llvm.org/show_bug.cgi?id=30584 It will be superseded by switching to JITLink.

When determining whether a weak symbol was already emitted (and its subsequent definition can thus be replaced by a declaration) look for emitted symbols in the symbol table and the in-process symbols, instead of possibly generating symbols through the JIT. This prevents symbols being emitted that refer into IncrementalExecutor::m_PendingModules elements that have since been removed.

`RunFunction()` might trigger unloading of `lastT`, so store its setting for "want value printing" in a local variable before calling RunFunction(). Fixes valgrind complaining about `./stressInterpreter` which is reloading the "outer" stressInterpreter.cxx from inside `RunFunction()`.

This reverts commit 0cdfa69. We need to hide json includes: modules cannot cope with them, and it introduces a runtime requirement to find json.

This reverts commit 89c5d0c. This file does not exist on all platforms for external nlohmann. Will have to use "proper" fwd decls instead.

This reverts commit 1c38aa0. We cannot rely on json_fwd (does not exist for all nlohman packages). Will have to use proper fwd decls instead.

This commit implements a new backend for distributed RDataFrame that allows sending computations to a [Dask](https://dask.org/) cluster. For example: ```python import ROOT from dask.distributed import Client RDataFrame = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame if __name__ == "__main__": client = Client("SCHEDULER_ADDRESS") df = RDataFrame("mytree","myfile.root", daskclient=client) df.Define("x","someoperation").Histo1D("x") ``` The implementation of the Dask backend uses the `dask.delayed` API to make the mapper and reducer functions of the `BaseBackend` lazy and builds a MapReduce computation tree: each lazy mapper is assigned to one DistRDF range, then any two mappers are reduced into one object until there is only the final result left. The result of this computation is returned to the user as usual. Similarly to the Spark backend, the user can specify an optional object representing the connection to the cluster (in this case with the `daskclient` argument). If it is not given by the user, the dataframe will create a default one connected to a `LocalCluster` instance with N worker processes, where N is the total number of logical cores on the user machine. The Dask backend will try to optimize the `npartitions` argument if it is not passed by the user, by setting it equal to the total number of available cores in the Dask cluster at the moment of the initialization of the dataframe. If this number is not available, it will return the default minimum number of partitions (2). Processing with Dask always involves spawning a few extra Python threads (used internally by the tool e.g. for communication). This required changing the triggering of the RDataFrame computation graph inside each distributed task. This now happens with the Python GIL released and with `ROOT.EnableThreadSafety()` called, to make sure that no two tasks can step on each other (either two C++ tasks or something else like a Dask thread calling into PyROOT for garbage collection). Also, `ROOT.gROOT.SetBatch(True)` is now called at the start of each distributed task, since the distributed workers won't need ROOT graphics while processing. Using the Dask backend from a Python script will print a progress bar in the terminal, as it happens with Spark. The same functionality cannot be reproduced in a notebook, since the Dask docs advertise to put the `progressbar` function as the last call in a notebook cell. But the `DaskBackend` class wraps this call inside the `ProcessAndMerge` function so the progress bar can't be shown for now. The `mapper` and `reducer` functions have been reworked so that all types of result pointers coming from an execution of the computation graph can be treated equally. These are usually `RResultPtr`, but there are special cases for `AsNumpy` and `Snapshot` which were so far treated with `if else` conditions. This commit uses `functools.singledispatch` to define a function that behaves differently based on the type of the result pointer and can thus both return the correct type of mergeable value and call the proper merge function. Co-authored-by: Enric Tejedor Saavedra <enric.tejedor.saavedra@cern.ch>

A new build option has been added, similarly to what was done for test_distrdf_pyspark. The option checks at configure time that the `dask` and `distributed` Python packages are installed on the system and available to the Python interpreter that is being used for the ROOT build. The two packages have also been added to the requirements.txt file with the appropriate comment.

alja and others added 30 commits October 12, 2021 12:34

In getIntersectAt() look for controll object also in parent

b594104

Fix selection of REveDataCollection items from table or GED controller.

1474ced

Implement the \legacy command (root-project#9104)

1380cc1

[skip-ci] new \legacy interface

4449ef4

[skip-ci] update to doxygen 1.9.3 (root-project#9111)

e18b171

[skip-ci] this tutorial produced a wrong output (root-project#9110)

6856f33

[skip-ci] the geometry and graphpainter anchors clashes (root-project…

ec4d01b

…#9113)

[cling] getSymbolAddressWithoutMangling() already handles injected sy…

a4b39b3

…ms and dlsym.

[cling] Remove llvm_unreachable:

7943e23

This is now hit for variables on Windows (which don't see the GlobalVariable case). Before the `#ifndef WIN32`, this case was not hit by roottest, neither on Windows nor macOS nor Linux - so the coverage of "make it a declaration" is good.

[io] Add error message before assert (root-project#8387).

0afa915

(cherry picked from commit 5af1fa4)

[ntuple] Vectorize RPageSource::LoadCluster(s)

1a7d279

[ntuple] Combine read requests from preloading multiple clusters

2d43bd8

[ntuple] Rename RPageSource::ColumnSet_t --> RCluster::ColumnSet_t

16ab156

[ntuple] Introduce RCluster::RKey helper struct

23ab0da

[ntuple] Add capability to issue vector cluster preloads

3bddca0

[ntuple] Cluster pool: control vector read size with cluster bunch IDs

d625799

[ntuple] Fix comment typo (NFC)

ffaa19f

[ntuple] Add capability to preload bunches of clusters

3ad4e4e

[ntuple] Fix-up cluster pool unit tests

0aa9f3f

[ntuple] Simplify vector construction in friend page source

2cd1af3

Co-authored-by: Vincenzo Eduardo Padulano <v.e.padulano@gmail.com>

Revert "[ntuple] Simplify vector construction in friend page source"

f0ace64

This reverts commit a9ef733.

Some tutorials did not show correctly in the ref guide. (root-project…

b7344a6

…#9125)

Grammar and typo fixes in TH1 documentation (root-project#9129)

ef81ad1

* Fix "an histogram" typos in TH1 documentation * Small grammar fixes in TH1 documentation

Enable StreamCapture() on Windows

43c07a2

Following the last commit, the `StreamCapture()` can now be enabled on Windows

vepadulano and others added 30 commits November 12, 2021 13:05

[NFC] Adapt one test, remove another which was redundant

1d74488

[cling] Let opt of transaction decide on inlining, fixes O3.

789954a

[cling] Set default CompilationOptions opt level to 1:

6996016

should have no effect as it will be anyway overridden by whatever CodeGen has.

[cling] Swap also TBAA as it references OldBuilder->CXXABI:

fa49cf3

Fixes CodeGen for opt levels > 0.

[cling] Prevent traling whitespace in -isystem.

852b439

Revert "[cxxmodules] Enable a module if json is present."

6ab1f24

This reverts commit 0cdfa69. We need to hide json includes: modules cannot cope with them, and it introduces a runtime requirement to find json.

[eve7] Do not rely on transient include of json.

5e6ae6d

[cmake] Spelling (NFC).

1ff9050

Revert "[json] add json_fwd.hpp file"

8566bd5

This reverts commit 89c5d0c. This file does not exist on all platforms for external nlohmann. Will have to use "proper" fwd decls instead.

Revert "Test if json_fwd.hpp exists"

9bf0c7c

This reverts commit 1c38aa0. We cannot rely on json_fwd (does not exist for all nlohman packages). Will have to use proper fwd decls instead.

[eve7] Use fwd decls of nlohmann/json types.

666cd79

[NFC] Extend RDF docs with Dask backend

6d231cc

[dataframe] Silence RSqliteDS warning / var used only in assert.

ded6175

Initial commit for AWS

b69d352

update

39e2e01

another small adj

fc7e3e8

more

e623804

working version

2e0f027

pass the certs

31a6194

update to newest

f928dff

Headers

8387142

General refactoring

4046e07

Fix missing file bug.

304ce19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase AWS dev on top of current ROOT master#3

Rebase AWS dev on top of current ROOT master#3
vepadulano wants to merge 577 commits intoCloudPyRDF:AWS-devfrom
vepadulano:AWS-dev

vepadulano commented Nov 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

vepadulano commented Nov 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants