Julia performance compared to Python+Numba LLVM/JIT-compiled code

Question

The performance benchmarks for Julia I have seen so far, such as at http://julialang.org/, compare Julia to pure Python or Python+NumPy. Unlike NumPy, SciPy uses the BLAS and LAPACK libraries, where we get an optimum multi-threaded SIMD implementation. If we assume that Julia and Python performance are the same when calling BLAS and LAPACK functions (under the hood), how does Julia performance compare to CPython when using Numba or NumbaPro for code that doesn't call BLAS or LAPACK functions?

One thing I notice is that Julia is using LLVM v3.3, while Numba uses llvmlite, which is built on LLVM v3.5. Does Julia's old LLVM prevent an optimum SIMD implementation on newer architectures, such as Intel Haswell (AVX2 instructions)?

I am interested in performance comparisons for both spaghetti code and small DSP loops to handle very large vectors. The latter is more efficiently handled by the CPU than the GPU for me due to the overhead of moving data in and out of the GPU device memory. I am only interested in performance on a single Intel Core-i7 CPU, so cluster performance is not important to me. Of particular interest to me is the ease and success with creating parallelized implementations of DSP functions.

A second part of this question is a comparison of Numba to NumbaPro (ignoring the MKL BLAS). Is NumbaPro's target="parallel" really needed, given the new nogil argument for the @jit decorator in Numba?

@user3666197 flaming responders and espousing conspiracy theories about SO responders engenders little sympathy for your cause. your answer is verbose and difficult to understand. your subsequent comments insult the goodwill of Julia users on SO who volunteer their time to answer questions. if you have constructive criticism about Julia performance timings versus Python/Numba, then consider posting a separate question on SO or a Julia user list. this question by hiccup is not the appropriate avenue. — Kevin L. Keys
– Kevin L. Keys, Commented Jan 6, 2017 at 2:19
Dear Kevin L. Keys, thx for a response to deleted comment, Fact#1 a practice to delete a post is called censorship, irrespective of the motivation for executing that kind of power. Fact#2 citation of the unfair timing practice, documented on LuaJIT discussion, is citation, not an opinion, the less any insult. Fact#3 constructive proposal was presented since the first post of the Answer, in as a reproducible MCVE, to allow running a coherent-experiment, whereas later comments have brought but incoherent-test factor (+new light from a documented principal Lua incident). — user3666197
– user3666197, Commented Jan 6, 2017 at 5:59
The beauty and the power of a scientific critical thinking is in it's ability to repeat tests to confirm or invalidate a theory, model or test. If the hiccup has asked about numba-LLVM/JIT-compiled performance and the published statement says a GIL-stepped interpreted code runs 22x slower, the experiment proposed below tested the zone of speed expectations for coherent-experiment (ought be run&updated on the side of the language maintainers+with a corrected fair timing method). Having sent a research proposal in this direction to prof. Sanders (now, MIT Julia Lab) it is fully doable. — user3666197
– user3666197, Commented Jan 6, 2017 at 6:18
Last, but not least, given your argumentation strives to protect (cit.:) "... the goodwill of Julia users on SO who volunteer their time to answer questions", let me request you to kindly pay the very same respect for my volunteered time to answer @hiccup-s question and good will to communicate the core merit, while being exposed to repetitive censorship and destructive down-voting hystery. If one considers the Answer below to be difficult to understand and/or verbose, it strived to cite facts in a repeatable MCVE -experiment, to allow those who can+want to re-run it to get results. — user3666197
– user3666197, Commented Jan 6, 2017 at 6:29
Given the fact that several previous comments on caching-hierarchy influence on tests were deleted & with a hope the censors would not delete a link to a similarly motivated Jean-François Puget's ( IBM France ) thorough experimentation to re-test Sebastian F. Walter's tests, but on a realistic sized matrices (where different caching strategies do show their edge)>>> ibm.com/developerworks/community/blogs/jfp/entry/… where SciPy+LAPACK show their remarkable edge on matrix sizes above 1000x1000. — user3666197
– user3666197, Commented Jan 7, 2017 at 5:22

user3666197 · Accepted Answer · 2017-01-05 17:31:35Z

9

This is a very broad question. Regarding the benchmark requests, you may be best off running a few small benchmarks yourself matching your own needs. To answer one of the questions:

One thing I notice is that Julia is using LLVM v3.3, while Numba uses llvmlite, which is built on LLVM v3.5. Does Julia's old LLVM prevent an optimum SIMD implementation on newer architectures, such as Intel Haswell (AVX2 instructions)?

[2017/01+: The information below no longer applies to current Julia releases]

~~Julia does turn off avx2 with LLVM 3.3 because there were some deep bugs on Haswell.~~

Julia is built with LLVM 3.3 for the current releases and nightlies, but you can build with 3.5, 3.6, and usually svn trunk (if we haven't yet updated for some API change on a given day, please file an issue). To do so, set LLVM_VER=svn (for example) in Make.user and then proceed to follow the build instructions.

edited Jan 5, 2017 at 17:31

user3666197

1

answered Apr 9, 2015 at 21:57

Isaiah Norton

4,4061 gold badge29 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mlubin · Accepted Answer · 2015-04-10 00:02:49Z

5

See here (section 4) for some peer-reviewed benchmarks which I personally worked on. The comparison was between Julia and PyPy.

answered Apr 10, 2015 at 0:02

mlubin

9535 silver badges10 bronze badges

1 Comment

hiccup Over a year ago

I excluded PyPy from consideration because it doesn't support SciPy, matplotlib, 64-bit Windows+Python & Python 3.3+. In 2013, when the referenced paper was written, PyPy also didn't support BLAS & LAPACK. For scientific applications, I prefer to compare to CPython+SciPy+LLVM (Numba or NumbaPro).

halfer · Accepted Answer · 2017-04-28 21:36:26Z

-2

(Comparing uncomparable is always a dual-sided sword.

The below is presented in a fair belief that LLVM / JIT-powered code benchmarks ought be compared to some other LLVM / JIT-powered alternatives should any derived conclusion shall serve as a basis for reasonably supported decisions.)

Intro :^{( numba stuff and [us] results come a bit lower down the page )}

With all due respect, julia-lang official site presents a tabulated set of performance testing, where two categories of facts are stated. The first, related to how the performance test was performed ( julia, using LLVM compiled code-execution v/s python, remaining a GIL-stepped, interpreted code-execution ). The second, how much longer do other languages take to complete the same "benchmark-task", using C-compiled code execution as a relative unit of time = 1.0

The chapter header, above a Table with results, says (cit.:)

High-Performance JIT Compiler
Julia’s LLVM-based just-in-time (JIT) compiler combined with the language’s design allow it to approach and often match the performance of C.

I thought a bit more rigorous to compare apples to apples and took just one of the "benchmark-task"-s, called the pi-sum.

This was the second worst time for interpreted python, presented to have run 21.99 times slower than a LLVM/JIT-compiled julia-code or C-compiled alternative.

So the small experimentation story started.

`@numba.jit( JulSUM, nogil = True )`:

Let's start to compare apples to apples. If julia code is reported to run 22x faster, let's first measure a plain interpreted python code run.

>>> def JulSUM():
...     sum = 0.
...     j   = 0
...     while j < 500:
...           j   += 1
...           sum  = 0.
...           k    = 0
...           while k < 10000:
...                 k   += 1
...                 sum += 1. / ( k * k )
...     return sum
...
>>> from zmq import Stopwatch
>>> aClk = Stopwatch()
>>> aClk.start();_=JulSUM();aClk.stop()
1271963L
1270088L
1279277L
1277371L
1279390L
1274231L

So, the core of the pi-sum runs about 1.27x.xxx [us] ~ about 1.27~1.28 [s]

Given the table row for pi-sum in language presentation on julia-lang website, the LLVM/JIT-powered julia code execution ought run about 22x faster, i.e. under ~ 57.92 [ms]

>>> 1274231 / 22
57919

So, let's convert oranges to apples, using numba.jit ( v24.0 )

>>> import numba
>>> JIT_JulSUM = numba.jit( JulSUM )
>>> aClk.start();_=JIT_JulSUM();aClk.stop()
1175206L
>>> aClk.start();_=JIT_JulSUM();aClk.stop()
35512L
37193L
37312L
35756L
34710L

So, after JIT-compiler has made it's job, numba-LLVM'ed python exhibits benchmark times somewhere about 34.7 ~ 37.3 [ms]

Can we go farther?

Oh sure, we have not done much of the numba tweaking yet, while the code example is so trivial, not much surprising advances are expected to appear down the road.

First, let's remove the here unnecessary GIL-stepping:

>>> JIT_NOGIL_JulSUM = numba.jit( JulSUM, nogil = True )
>>> aClk.start();_=JIT_NOGIL_JulSUM();aClk.stop()
85795L
>>> aClk.start();_=JIT_NOGIL_JulSUM();aClk.stop()
35526L
35509L
34720L
35906L
35506L

nogil=True
does not bring the execution much farther,
but still shaves a few [ms] more, driving all results under ~ 35.9 [ms]

>>> JIT_NOGIL_NOPYTHON_JulSUM = numba.jit( JulSUM, nogil = True, nopython = True )
>>> aClk.start();_=JIT_NOGIL_NOPYTHON_JulSUM();aClk.stop()
84429L
>>> aClk.start();_=JIT_NOGIL_NOPYTHON_JulSUM();aClk.stop()
35779L
35753L
35515L
35758L
35585L
35859L

nopython=True
does just a final polishing touch
to get all results consistently under ~ 35.86 [ms] ( vs. ~57.92 [ms] for LLVM/JIT-julia )

Epilogue on DSP processing:

For the sake of the OP question about additional benefits for accelerated DSP-processing,
one may try and test numba + Intel Python ( via Anaconda ), where Intel has opened a new horizon in binaries, optimised for IA64-processor internalities, thus the code-execution may enjoy additional CPU-bound tricks, based on Intel knowledge of ILP4, vectorisation and branch-prediction details their own CPU-s exhibit in runtime. Worth a test to compare this ( plus one may enjoy their non-destructive code-analysis tool integrated into VisualStudio, where in-vitro code-execution hot-spots could be analysed in real-time -- a thing a DSP engineer would just love, wouldn't he/she?

edited Apr 28, 2017 at 21:36

halfer

20.2k20 gold badges111 silver badges208 bronze badges

answered Jan 1, 2017 at 20:14

user3666197

1

14 Comments

David P. Sanders Over a year ago

Did you actually run the Julia code on your own machine? Which exact code? What was the timing? I suggest multiplying the workload by a factor of at least a hundred to have a fairer comparison.

user3666197 Over a year ago

( Yes, the 500x repeated 10k loop could be run much more times, however I kept the cited julia-lang site methodology 1:1 ).

David P. Sanders Over a year ago

Comparing Julia to numba is both sensible and interesting. But in order to do so, the codes must both obviously be run on the same machine.

David P. Sanders Over a year ago

For what it's worth, Julia 0.5 is twice as fast as numba on my machine for this particular micro-benchmark.

daycaster Over a year ago

here's an example of an alternative approach, where perhaps GitHub is superior to StackOverflow for extended discussions and analysis.

|

Collectives™ on Stack Overflow

Julia performance compared to Python+Numba LLVM/JIT-compiled code

3 Answers 3

Comments

1 Comment

Intro :^{( numba stuff and [us] results come a bit lower down the page )}

`@numba.jit( JulSUM, nogil = True )`:

Can we go farther?

Epilogue on DSP processing:

14 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Intro : ( numba stuff and [us] results come a bit lower down the page )

@numba.jit( JulSUM, nogil = True ):

Can we go farther?

Epilogue on DSP processing:

14 Comments

Your Answer

Sign up or log in

Post as a guest

Related

Intro :^{( numba stuff and [us] results come a bit lower down the page )}

`@numba.jit( JulSUM, nogil = True )`: