Performance of Julia for High Energy Physics Analyses

We argue that the Julia programming language is a compelling alternative to currently more common implementations in Python and C++ for common data analysis workflows in high energy physics. We compare the speed of implementations of different workflows in Julia with those in Python and C++. Our studies show that the Julia implementations are competitive for tasks that are dominated by computational load rather than data access. For work that is dominated by data access, we demonstrate an application with concurrent file reading and parallel data processing.


Introduction
In high-energy physics, the programming language of choice for most of the large-scale experiments like those at the the B-Factories, the Tevatron and the LHC has been C++. It is being used for the simulation and reconstruction chain, as well as for data analysis, which also makes heavy use of the Root [2] data analysis framework. Recently, a change in this paradigm has been observed, where data analysis has been moving towards using Python scripts. Reasons for this move include the faster turn-around of a dynamic scripting language, a lower hurdle for beginners and the growing versatility of Python thanks to packages like NumPy [18] and SciPy [25], developments like Py-Root, and data science packages like Pandas. Most of the industry-standard deep learning tool kits include Python bindings.
A frequent criticism of languages like Python compared to compiled languages like C++ or Fortran is the performance penalty due the dynamic character. The Julia programming language [4] originates from the highperformance numerical analysis community and is designed to combine the benefits of dynamic languages like Python with the performance of compiled code by using a just-in-time compiler (JIT) approach. In this note we demonstrate that Julia exhibits three key features that make it uniquely suitable for data analysis in high energy physics: 1. Julia is fast: We perform benchmarks of data analysis tasks that are typical in high energy physics, implemented in Julia, Python and C++ and compare their performance. 2. Julia is easy to interface with existing C++ libraries: As an example, we have created an interface to the FastJet library [5], the de-facto standard implementation of jet clustering in high energy physics. 3. Julia has support for parallelism and concurrency: We demonstrate how even a trivial implementation of event-level parallelism can accelerate realistic analysis workflows, even if the underlying I/O libraries do not support multi-threading.
Additionally, Julia is interactive, being one of the original languages that integrates with Jupyter 1 [17] notebooks. What is more, the Julia community is developing a growing number of packages to facilitate statistical data analysis and the generation of scientific plots in production quality. All plots in this note were created using Julia packages. This paper is organized as follows; first we give a brief introduction into the features and capabilities of Julia, then we describe the setup and the results of the benchmarks for typical analysis tasks and then we summarize our results.

The Julia Language
The Julia language is being used in a growing number of scientific applications. It combines the benefits of a dynamic language like Python with the performance of a compiled language like C++ or Fortran. Julia uses a just-in-time compiler (JIT) to generate the executable code, provides garbage collection and supports distributed parallel as well as multi-threaded execution. It has been demonstrated to perform at peta-scale on a high-performance computing platform [20], and it has strong support for scientific machine learning (see, e.g. [19]). As we show in the following, it is also very well suited for developing software in large distributed collaborations like those for high energy physics collider experiments.

Key Features
As Julia is mainly targeted for scientific applications, it supports arbitrary precision integers and floats using the GNU Multiple Precision Arithmetic Library (GMP) [10] and the GNU MPFR Library [6]. Complex numbers and an accurate treatment of rational numbers are also implemented. The base mathematical library is extensive and includes a linear algebra package with access to specialized libraries such as BLAS [9] or MKL [11] for the computation. The language fully supports Unicode, which allows the user to write mathematical formulae close to how they would be typeset in a book or paper.
Julia has a builtin package manager, which uses a central repository for registered packages, the so-called Julia registry, but the user has the choice to define their own registry, or to add unregistered packages. Packages with binary dependencies can be built and distributed for a variety of platforms automatically. This frees users from concerns about compiler versions or library incompatibilities. While most of the computing in high energy physics happens on centralized server farms that make code repositories available, students usually do not have access to such farms, and installing different libraries that are being used in high energy physics can be cumbersome and an obstacle to teaching. The package manager in Julia can be used to install arbitrary binaries, regardless of what programming language their code is written in. This simplifies interfacing Julia code with commonly used libraries in other languages.
The three main ways to execute code in Julia are described in the following.
Scripts Similar to Python, code is written to a text file, which can be executed by the Julia executable. REPL The default Read-Eval-Print Loop (REPL) in Julia is a highly customizable command line interface, with command history and tab completion. In addition, the REPL comes with several modes, such as a shell mode to execute commands in the shell of the operating system, a help mode to inspect Julia code and a package mode to update and maintain the installed packages. Packages can modify the REPL to make additional modes available. Jupyter notebooks Jupyter notebooks have become popular in data science applications, where they provide a convenient way to combine code input blocks with graphical output in the same interface. Julia is very well supported in the Jupyter ecosystem.

Language Maturity
The first public release of Julia code was on February 13th, 2013. The language has reached version 1.0 on August 8th, 2018, which comes with a promise of API stability throughout the v1 release cycle. This means that code that works on v1.0 of the language will continue to work without any changes in any upcoming v1.x release. The most recent release as of the writing of this note is v1.3.

Implementation details and code distribution
Our study of using Julia in a typical high energy physics analysis workflow uses the LCIO event data model and persistence format, which has been developed since 2003 [8]. The code is released under the BSD-3-Clause license and hosted on Github 2 . The release contains the C++, Fortran, Python, and Go code by default. The generation of Fortran bindings (through C method stubs) can be enabled by a flag at build time. The Python bindings can be enabled at build time, if Root is present. In this case, Root can generate dictionaries for the LCIO classes, which are then made available to Python through the PyRoot mechanism. Additionally, the distribution contains some convenience functions to make the bindings more pythonic.
The Julia bindings for LCIO (LCIO.jl [14]) were created by wrapping the C++ classes using the CxxWrap Julia module and the corresponding C++ library 3 . This module requires declaring each class and method to be exposed on the Julia side to the wrapper generator, which then automatically creates the boilerplate code to translate between Julia and C++. This translation is transparent to the user, and we have kept the LCIO function names intact, which means that users can still take advantage of the documentation of the upstream library to learn about functionality. The workflow for this is similar to that for creating BOOST.Python bindings.
This method allows fine-grained control over how the bindings should behave on the Julia side. For example, it allows us to define a more convenient return type for the three-and four-vectors in LCIO, which are bare pointers in C++, but properly-dimensioned arrays in Julia. Additionally, the Julia bindings allow the use of collections of properly typed objects, where LCIO only provides collections of pointers to a base class. This frees the user from having to look up the class type for each of the collections that will be used in the analysis and is helpful particularly for new users. While the shortcomings in this example are straightforward to address on the C++ side, our work shows that this method for defining the bindings is a convenient and powerful way to improve the usability of existing libraries without having to modify the underlying code.
For the deployment, we are using the BinaryBuilder 4 package to automatically build binary distributions for the LCIO library, as well as for the C++ component of the wrapper library. The same goes for the FastJet library and corresponding glue code. Since our package was added to the general Julia registry, it can simply be installed with the command add LCIO in the package manager. This downloads the correct binaries for the user's environment as well as the glue code between LCIO and Julia. The user does not need to compile the C++ code or worry about compatibility between different library or compiler versions.

The Test Setup
To demonstrate a simple but realistic workflow, we reconstruct the invariant mass of the process e + e − → Z 0 → µ + µ − , and we perform simple event-shape and jet clustering analyses on a set of e + e − → Z 0 → qq events. The events were generated with the Whizard event generator [16], their interactions with the SiD model were simulated using Geant4, and the particles were reconstructed using the particle flow algorithms of the PandoraPFA package [23]. Events are stored in the LCIO format. To exclude potential OS variations, all results were run on a CentOS Linux 7.7 installation.
We have implemented the analysis code in C++ and Python, as well as in Julia. All three implementations share the same underlying routines for reading LCIO files. For the computation of the invariant mass in C++ and Python, we use the TLorentzVector class of the Root package; this is common practice, particularly among students. For simplicity, we are not implementing a four-vector class in Julia and instead compute the invariant mass from the four-vectors explicitly.
For the three languages, C++, Python and Julia, the following versions were used:

C++:
gcc.6.2.0 Python: Python 2.7.12 Julia: The Root version used by the C++ and Python programs was 6.08/06. The plots in this note were created using the histogram (OnlineStats and Plots) and fitting packages (Distributions) from the Julia General registry [15].
In order to reduce fluctuations due to external effects like changes in the network or I/O performance, each measurement has been repeated five times. We report here the average of the five runs.

Benchmarking
To compare the performance of Julia compared to other languages like C++ or Python, a few toy analysis tasks have been implemented to illustrate the performance differences between them.

Fitting the Invariant Mass
In this simple toy analysis implemented in C++, Julia and Python, we process a series of LCIO files with e + e − → Z 0 → µ + µ − events to reconstruct the Z mass using a simplified Gaussian fit. This is mostly an I/O-dominated application. For the small-size data set of 32400 events, the C++ implementation is the clear winner, approximately twice as fast compared to Julia and Python (see Tab.1). Adding the compile time for the first run of the C++ variant of 1.76 seconds does not change the conclusion significantly. This picture changes when one looks at the larger data sets. The throughput of the Python and C++ implementations remains more or less constant, as can be expected for interpreted and ahead-of-time compiled languages, respectively. Julia, on the other hand, shows a trend of growing throughput as the data set size increases. This behavior can be attributed to the overhead of the just-intime compilation step, which in our tests is most noticeable in the benchmark with the smallest data set. The trends of processing times for the three language implementations are shown in Figure 1.
The question is now how large the overheads of the just-in-time compiled (Julia) and interpreted (Python) languages are compared to the C++ implementation. To get an estimate for this, the execution time (using the Unix

Event Shape Analysis of Hadronic Z Events
While studies of an I/O-limited process are useful tests of a realistic workflow, the advantage of Julia over Python comes with a more processing-intense algorithm, possibly with multiple nested loops over reconstructed objects. As an example of such an algorithm with a realistic use case, classical event-shape variables like thrust and the Fox-Wolfram Moments [7] were implemented following the implementations in Pythia 6.4 and Pythia 8.2 [21,22]. For the analysis a set of LCIO files with e + e − → Z 0 → qq events were used.
The implementation in Julia (EventShapes.jl [12]) is a straightforward translation of the C++ code, which we have verified to produce identical output. This process revealed another important feature of Julia: We found the availability of tools to support the development process to be comparatively better than in C++ or Python. Our first version was several times slower than the C++ version, but a simple annotation with the @btime macro from the Benchmarking package helped us realize that this was mainly due to memory allocations in the inner loop. Another frequent cause of slowdown is type instability, which prevents the compiler from generating specialized code. In Julia, it is straightforward to identify this symptom with the @code warntype macro. While these features aided us in our code translation, they are even more important in cases where no reference code exists and implementations of algorithms have to be developed from scratch. The time for the Julia implementation to process 233857 events was 17 minutes and 37 seconds, while the C++ implementation took 22 minutes and 2 seconds for the same data set.
The implementation in Python is a straightforward translation of the C++ code. It is understood that this does not result in code that is optimal for performance, but in any case code optimization is an expert-level operation that requires experience in performance characteristics of supporting libraries, as well as the underlying hardware. Particularly in student settings, it is important that algorithms can be implemented in a straightforward manner from a scientific paper. Vandewalle, Kovacevic and Vetterli [24] define five levels of reproducibility, with the criteria for the highest and most desirable level requiring the research to be re-implemented within 15 minutes or less. The limitations of Python, however, would require a substantial re-write of the algorithm to avoid for loops as much as possible. Some tools are available in the SciPy [25] distribution to speed up for loops by adding a compilation step using a C compiler, but these steps come with its own limitations and idiosyncrasies when interfacing with Root code, and a thorough comparison of the various ways to speed up Python goes beyond the scope of this note.
Because of the slow default implementation, we have limited the processing to 1% of the number of events that were processed in Julia and C++. The time for computing the Fox-Wolfram moments and thrust in 2339 events (compared to 233857 events for the other two languages) is 36 minutes and 21 seconds. Similarly to our efforts of speeding up the Julia implementation, we have applied some straightforward optimization to the Python code. One common strategy in Python, in particular in combination with NumPy is to use vectorization across the outer dimensions of an array. In our case, however, this increased the run time to 43 minutes and 43 seconds. This supports our point that optimizations in numerical algorithms require excellent tool support and measurements of benchmarks, and "common-sense" optimizations are easy to apply prematurely and in an incorrect manner. Further optimizations are certainly possible but would require a more fundamental re-design of the algorithm, or would involve additional languages like cython [3] or numba [1], and are thus beyond the scope of this note.

Jet Clustering
We anticipate that most users will use Julia for data analysis in high energy physics in two ways: Either, by writing code directly in Julia, for example, by translating existing codes, as demonstrated in Section 5.2, or by calling into existing C++ libraries. 5 As a demonstration of the latter, we have written simple bindings to the frequently used FastJet package (FastJet.jl [13]) in Julia, using the same CxxWrap package that we used to create the LCIO bindings. Similarly to the LCIO bindings, the code has been added to the Julia registry and can be installed with add FastJet in the Julia package manager. Processing the same 233857 events as in Section 5.2 took 15 minutes and 35 seconds for the Julia implementation and 8 minutes and 10 seconds for the C++ implementation.

Parallelizing in Julia
As the numbers of cores on modern processors keeps growing, the event-level parallelism that has been exploited by high energy physics experiments for decades is no longer sufficient to take optimal advantage of the available processing power. Julia has several constructs to support parallel programming. We distinguish here between concurrent and parallel programming. In a concurrent application, different pieces of the code run in different processes, and the OS takes care of scheduling between them. For example, a program could read a file concurrently with, i.e. at the same time as, setting up a canvas for plotting, since the I/O usually has a small ratio of CPU time over wall clock time. This is a straightforward way to speed-up programs that wait for I/O tasks to complete, and this level of concurrency is also available in Python, e.g. in the multiprocessing module. In Julia, the Distributed module allows scheduling tasks in different processes, either on the same CPU, or on different CPUs that are connected by a network.
On the other hand, the level of thread-parallel programming that is supported in Julia since version 1.3.0 is not available in Python, due to the global interpreter lock (GIL). In this level of parallelism, different parts of the program run in parallel in the same process space. On multiple cores, they can be scheduled such that each thread takes full advantage of a different core. Our implementation of parallel event processing is based on Julia Tasks, and we have implemented concurrent programming concepts on two levels. The communication between different Tasks happens via Channels, which promotes a design similar to that of the Go programming language, for example. We are combining data readers on separate concurrent processes (in our case on the same CPU) using the Distributed module with data processors running in parallel on different threads using the Threading module. Combining these two levels in Julia is straightforward, as shown below. This allows us to make optimal use of the available processing power given the specific I/O and processing characteristics of our specific application.

File-parallel Data Access
At the time of this writing, the LCIO implementation is not thread-safe and does not support parallel reading of events. We are instead taking advantage of the fact that common workflows operate on multiple files concurrently. In our implementation, we achieve this fact by combining several different LCIO readers, each running in its own independent process. By relegating different instances of the LCIO event readers to separate processes, we do not require the library to be thread-safe. This comes at the cost of having additional overhead from inter-process communication, but our approach is applicable to basically any C++ library, regardless of whether it is thread-safe or not. Furthermore, it is straightforward to change the number of concurrently running instances of LCIO to optimize the ratio of event readers to event processors. In our implementation, the different LCIO processes run on the same node, but the extension of this kind of concurrency to multiple connected nodes is straightforward.

Putting It All Together: Processing Event Shapes and Jets in Parallel
Since Julia version 1.3.0, the runtime supports multi-threaded execution, which simplifies the implementation of within-event parallelism tremendously. To demonstrate the achievable speed-up, we are processing the event thrust and the Fox-Wolfram moments, implemented in Julia, and the jet clustering in C++ FastJet, for each event. Our example executes multiple event analyzers in parallel on different threads. While this paradigm allows for sharedmemory parallelism, we have opted here to also use Channels for communication. In the following we sketch the implementation in Julia, show-casing how straightforward it is to set up multi-threaded event processing using the Distributed package. The number of available workers can be defined during the Julia startup, e.g. julia -p 8 would start eight eight worker nodes within Julia. However the number of threads needs to be set using the environment variable JULIA_NUM_THREAD before invoking the Julia executable.
In the Julia main function, several RemoteChannel(()->Channel()) are being opened to ensure communication between the data readers and the data processors. The code skeleton of the main function for this is shown in Listing 1, illustrating the usage of channels and spawning the data readers on specific processes (using @spawnat), and the spawning of the data processors on different threads within the main process using the Threads.@spawn command. # we now have several functions waiting for input on their channels # prepare the file names for f in ARGS put!(fnames, f) end close(fnames) # wait for all readers to be done # then we can close the event queue and the processors can finish nDone = 0 nEvents = 0 while nDone != nworkers() nDone += 1 nEvents += take!(done) end close(events) # wait for the processing to finish totalProcessed = 0 for p in processors wait(p) totalProcessed += take!(nProcessed) end close(nProcessed) # if all of the channels are closed, the Tasks can finish # at this point, we should have: nEvents == totalProcessed end

¦ ¥
The data reader function is shown in Listing 2. The @everywhere macro instructs Julia to make this function available on all processes. The data reader reads in events from a LCIO file and pushes them into a RemoteChannel().

¦ ¥
The data processors that have been spawned using Threads.@spawn are now listening on the RemoteChannel() for events being available for processing as shown in Listing 2.
Listing 3 The data processor function used to demonstrate multi-threaded processing § ¤ function processEvents(events, nProcessed) while true try collection = take!(events) #receive next event h10, h20, h30, h40 = EventShapes.foxWolframMoments(collection) #more analysis and clustering catch e #store number of Events that have been processed put!(nProcessed, iEvents) break end end end

¦ ¥
The available speed-up using this paradigm is shown in Figure 3. It shows clearly that the processing is mainly I/O-dominated: More data readers gives a better performance, up to a maximum of 15. The maximum value of 20 readers in our test leads to a decrease in performance, most likely due to contention between different processes. Additionally, one can see that more data processors does not necessarily lead to a better overall throughput. In fact, in our tests, the highest event throughput was achieved with only five data processors and eight data readers. The optimal values for numbers of readers and processors, as well as the buffer depth on the Channel between the two, depend on the workload and the details of the hardware configuration. A detailed analysis exceeds the scope of this note, but our chosen paradigm for   Fig. 4 Relative speed-up of the multi-threaded Julia code relative to the single-threaded C++ application for different numbers of concurrent data readers and data processors setting up concurrent processing makes it trivial to evaluate the performance for a given setup and optimize parameters for larger processing jobs. Figure 4 shows the relative speed-up that can be achieved in this way over a single-threaded C++ application. For reference, a single-threaded application processing the events in Julia takes 19 minutes and 1 second and the C++ processing takes 21 minutes and 53 seconds.

Summary
For many years, the high energy physics community has spent significant resources on adding interactivity to their software to facilitate plotting and data exploration. Data science packages developed by other fields have realized the same need for interactive data access, and the standard packages in this sector are all accessible from the interpreted language Python. In this note, we have demonstrated that the Julia programming language can offer a compelling option for data analysis in high energy physics. The interactivity of the language is on par with Python, with support for Jupyter notebooks and a rich ecosystem for plotting and statistical data analysis.
With a large set of specialized codes in C++ and Fortran, developed over several decades, interoperability with these languages is a key requirement for any data analysis solution in high energy physics. We have shown that it is straightforward to interface Julia with existing libraries. What is more, the package manager in Julia makes it straightforward to install the code, without having to install a compiler for the other languages. The overhead of calling these legacy codes is measurable for simple applications, but negligible for applications with a computation-heavy component.
Fast execution is a requirement for code that has to process millions or even billions of collision events. We have demonstrated that complex algorithms consisting of multiply nested loops implemented in Julia perform on par with, or even better than, an implementation in C++. The language and package ecosystem have strong support for debugging and studying the software performance. Future developments of the underlying compiler are likely to further improve that performance.
The increased use of many-core systems and the growing demand on multithreading applications to take advantage of the available hardware position Julia extremely well for a growing role in scientific applications in general. We have demonstrated a straightforward way to speed up processing of existing algorithms, without requiring that the algorithms themselves be multithreaded. The composable parallelism in Julia allows combining this level of parallelism with algorithms that are themselves multi-threaded in nature 6 . A growing set of examples of such algorithms are available online. This feature is what makes Julia has an excellent choice for an analysis language in high energy physics, as it allows exploring new algorithms directly at the analysis level without having to drop down to the underlying C++ reconstruction framework for want of better performance.
In summary, Julia currently presents a compelling option for data analysis in high energy physics, and its multi-threading capabilities positions Julia well for future developments. A common challenge for summer students is which language to teach them. With C++, significant time is spent teaching them the details of memory management and how to avoid segmentation faults, while with Python, different packages use different syntax to avoid the inherent slowness of for loops. Julia advertises itself as solving the two-language problem, where users combine a statically compiled language for performancecritical code with an interpreted language for interactivity. Our studies show that the language keeps this promise and it is straightforward to write highperformance code. The language enables an interactive exploration of data and facilitates the exploration of complex algorithms and is thus an excellent partner to the C++ processing frameworks used in high energy physics.