Getting Started#

First Steps: Setup#

The traditional CMS workflow can be broken into the following stages:

  • Data Discovery, in which lists of root files are associated with given metadata values

  • Object Definition, in which electrons, muons, jets, or other objects are filtered based on a given criteria

  • Event Preselection, in which events are filtered based on a fairly loose set of criteria. These criteria are expected to remain fixed throughout the analysis

  • Event Selection, in which events are further filtered with a far-stricter set of criterion

  • Plotting, in which a series of histograms are created, filled with various data, and saved

In this framework, the entire analysis can be represented in a single file, consisting of an instance of a class extending configs.base.AnalysisConfig. This is an abstract class, meant to represent any given analysis. It consists of the following methods:

  • __init__(name) accepts one parameter, which is the name of the directory in which any results will be saved.

  • get_dataset(xcache_host) returns a dataset definition populated with a list of remote root files, using the given xcache host. By convention, the metadata provided should contain shortName, isData, and optionally nevents and xsec. More info may be found on the datasets page. #TODO

  • define_objects(events) is the analysis’s object definition method. It is expected to return a version of the events parameter after redefining objects as-needed. This modification is presumed to be destructive.

  • preselect_objects(events) is the first of two object selection functions. Given events with objects defined as per the above method, it is expected to return a version of the events parameter after event preselection. While intended to be a destructive operation, many preselection functions are non-destructive.

  • select_objects(events) is the second object selection functions. Given events passing preselection and with objects defined as per the above methods, it is expected to return a version of the events parameter after event selection. While intended to be a destructive operation, many preselection functions are non-destructive.

To promote code reuse as much as possible, the histogramming and plotting code has been abstracted. The configs.base.ThingToPlot class is the fundamental object in plotting, and consists of the following methods:

  • __init__(name) creates the object. Though many subclasses require fewer or more parameters, the name parameter is both the key in which the histogram is saved in the output file as well as the title of the generated plot

  • create_histogram is intended to create a hist.Histogram object with any given amount of axis, though it may return any object that can be pickled and reduced.

  • fill_histogram(histogram, events, dataset, weights, **kwargs) is designed to apply the data provided to the object created in create_histogram. This is presumed to be a destructive method on the histogram object, but non-destructive on events, dataset, weights, and other kwargs.

  • plot_histogram(histogram, output_file) is designed to save a copy of the filled and reduced histogram generated above. Note that this is done in addition to a pickled version of the histogram object, making this function ideal for saving large (and recreatable) objects such as graphics.

To interface with the plots being generated, the AnalysisConfig class contains two methods:

  • augment_events(events) is designed to create additional information that may be shared across several plots, such as the total number of events, or masks to be shared across multiple plots. The result of this function will be passed to fill_histogram as keyword arguments.

  • get_things_to_plot is expected to return a list of ThingToPlot objects.

Running A Workflow#

To run a configuration, it must first be added to the utils class’s get_config method, which will enable its use in command-line utilities.

Then, the main.py file can be used to run your analysis. The following CLI parameters may be of interest:

  • -c, --channel should be set to your analysis config.

  • -w, --max-files-per-category may be used to limit the analysis to a set number of files. This will change nevents for MC files if the dataset utilities are used, but not reweight data files. As such, this should only be used with a dask.distributed.LocalCluster for validating Python code.

  • -s, --no-skim should be set accordingly. While skims are not yet present, and as such, the processor will default to raw files, it is still best-practice to include this flag to avoid running on incomplete skims later-on.

Once the workflow is completed, it may be necessary to regenerate plots, such as to change the rebinning options. The plotter.py file may be used in this case, which will rerun only the plot_histogram function for each ThingToPlot.

Skimming#

Adapting an AnalysisConfig for skimming is fairly straight-forward. The following modifications will need to be done:

  • define_objects(events) and preselect_objects(events) must be adapted in a way that allows them to work with dask_awkward.Array objects instead of awkward.Array objects. This should be fairly straight-forward to implement.

  • A new minify(events) function should be introduced to the configuration. This method is expected to trim unneeded branches from the events object.

Then, the skim.py file may be used to create skims. Note that the skims are saved on the worker nodes - if the workers do not have access to the local filesystem, the skims will need to be saved to a shared filesystem such as EOS. This is currently non-functional.

Additionally, the skimming code supports making skims from a baseline skim. This is currently non-functional.