Problem.
I have a directory of .root files (the proprietary format of CERN's ROOT framework). These files contain so-called TTree's, essentially tables who's cells may contain virtually any c++ object or type. In my case, these files contain tables of integers, which I wish to perform some operations on to produce a single, high precision floating-point value. I'm using Boost's cpp_dec_float_100 type (from the boost::multiprecision library) due to my precision requirements. I then write these to a new .root file.
Step through my code:
Firstly, I read into a vector<string> the lines of a file in_path, which contain the paths to my .root files.
ifstream stream(in_path);
string tem;
vector<string> list;
while(getline(stream, tem))
list.push_back(tem);
Then, I construct an RDataFrame on them. The RDataFrame, for my purposes, essentially efficiently combines them into one large table.
ROOT::RDataFrame df("eventTree", list);
Now, a bit about my data. You can imagine the table as looking something like this:
+-----+-----+-----+
| run | A | B |
+-----+-----+-----+
| 001 | 35 | 5 |
| 001 | 40 | 10 |
| 001 | 77 | 60 |
| | | |
| ... | ... | ... |
| | | |
| 002 | 42 | 40 |
| 002 | 30 | 28 |
| 002 | 50 | 1 |
| ... | ... | ... |
+-----+-----+-----+
where the ... dots indicate continuation
For each run, I want to determine the most common difference of A and B, element-wise. This value will factor into my later calculations. So, for example, for run 002 we have 42-40 = 2, 30-28=2, and 50 - 1=49, the most common element/mode of 2,2, and 49 is 2, so the result for run 002 is 2.
A couple of important notes: A-B is guaranteed to be a positive, integer value for all entries, and there is guaranteed to be a single, unique most common difference for each run
What I do is to map each run to another map, which maps difference values to frequency. In the end, I can simply get the key of the largest value (i.e. the most common difference).
I iterate over each row, take the difference of A and B, and increment the value of the corresponding key in the map.
map<int, map<int, int>> offset;
df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
++offset[run][A- B];
},{"A","B","run"});
Returning to the previous example, run 002 would be mapped to a map that looks like:
{{2, 2}, {49, 1}}
Next, I create a new column in the RDataFrame using .Define(), which takes the name of the column (eventTime), and a lambda which returns the value for each row in the new column. It is within this lambda that I perform my calculation.
df.Define("eventTime", [&](UInt_t timeStamp, UInt_t B, Int_t run){
// Get most common difference for run, add to B
B+= max_element(offset[run].begin(), offset[run].end())->first;
return Mult_t(B+ Mult_t(timeStamp) / 1E+8);
},{"timeStamp","B","run"}).Snapshot("T", out_path, {"eventTime"});
Finally, .Snapshot saves the new column as a new TTree in a new .root file.
Code.
Dependencies: boost::multiprecision library, and the ROOT framework.
#include <boost/multiprecision/cpp_dec_float.hpp>
using Mult_t = boost::multiprecision::cpp_dec_float_100;
void writeTimes(const char* in_path, const char* out_path){
ifstream stream(in_path);
string tem;
vector<string> list;
while(getline(stream, tem))
list.push_back(tem);
ROOT::RDataFrame df("eventTree", list);
map<int, map<int, int>> offset;
df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
++offset[run][A- B];
},{"A","B","run"});
df.Define("eventTime", [&](UInt_t timeStamp, UInt_t B, Int_t run){
B+= max_element(offset[run].begin(), offset[run].end())->first;
return Mult_t(B+ Mult_t(timeStamp) / 1E+8);
},{"timeStamp","B","run"}).Snapshot("T", out_path, {"eventTime"});
}
Goals.
- Readability: I'm always looking to improve the readability and "flow" of my code, which is often destined for use by others.
- Reliability: I find that a second pair of eyes is often useful for pointing out (sometimes obvious) errors and potential bugs in my code. For this code in particular, reliability is vital.
- Performance: A secondary goal is performance. I'm not terribly concerned about optimizing what I have now, however if there are some simple/obvious changes that can or should be made to improve performance, I'm certainly interested.
Note: this code is written to be run or compiled with the
ROOTinterpreter.