15

The performance benchmarks for Julia I have seen so far, such as at http://julialang.org/, compare Julia to pure Python or Python+NumPy. Unlike NumPy, SciPy uses the BLAS and LAPACK libraries, where we get an optimum multi-threaded SIMD implementation. If we assume that Julia and Python performance are the same when calling BLAS and LAPACK functions (under the hood), how does Julia performance compare to CPython when using Numba or NumbaPro for code that doesn't call BLAS or LAPACK functions?

One thing I notice is that Julia is using LLVM v3.3, while Numba uses llvmlite, which is built on LLVM v3.5. Does Julia's old LLVM prevent an optimum SIMD implementation on newer architectures, such as Intel Haswell (AVX2 instructions)?

I am interested in performance comparisons for both spaghetti code and small DSP loops to handle very large vectors. The latter is more efficiently handled by the CPU than the GPU for me due to the overhead of moving data in and out of the GPU device memory. I am only interested in performance on a single Intel Core-i7 CPU, so cluster performance is not important to me. Of particular interest to me is the ease and success with creating parallelized implementations of DSP functions.

A second part of this question is a comparison of Numba to NumbaPro (ignoring the MKL BLAS). Is NumbaPro's target="parallel" really needed, given the new nogil argument for the @jit decorator in Numba?

5
  • 1
    @user3666197 flaming responders and espousing conspiracy theories about SO responders engenders little sympathy for your cause. your answer is verbose and difficult to understand. your subsequent comments insult the goodwill of Julia users on SO who volunteer their time to answer questions. if you have constructive criticism about Julia performance timings versus Python/Numba, then consider posting a separate question on SO or a Julia user list. this question by hiccup is not the appropriate avenue. Commented Jan 6, 2017 at 2:19
  • Dear Kevin L. Keys, thx for a response to deleted comment, Fact#1 a practice to delete a post is called censorship, irrespective of the motivation for executing that kind of power. Fact#2 citation of the unfair timing practice, documented on LuaJIT discussion, is citation, not an opinion, the less any insult. Fact#3 constructive proposal was presented since the first post of the Answer, in as a reproducible MCVE, to allow running a coherent-experiment, whereas later comments have brought but incoherent-test factor (+new light from a documented principal Lua incident). Commented Jan 6, 2017 at 5:59
  • The beauty and the power of a scientific critical thinking is in it's ability to repeat tests to confirm or invalidate a theory, model or test. If the hiccup has asked about numba-LLVM/JIT-compiled performance and the published statement says a GIL-stepped interpreted code runs 22x slower, the experiment proposed below tested the zone of speed expectations for coherent-experiment (ought be run&updated on the side of the language maintainers+with a corrected fair timing method). Having sent a research proposal in this direction to prof. Sanders (now, MIT Julia Lab) it is fully doable. Commented Jan 6, 2017 at 6:18
  • Last, but not least, given your argumentation strives to protect (cit.:) "... the goodwill of Julia users on SO who volunteer their time to answer questions", let me request you to kindly pay the very same respect for my volunteered time to answer @hiccup-s question and good will to communicate the core merit, while being exposed to repetitive censorship and destructive down-voting hystery. If one considers the Answer below to be difficult to understand and/or verbose, it strived to cite facts in a repeatable MCVE -experiment, to allow those who can+want to re-run it to get results. Commented Jan 6, 2017 at 6:29
  • Given the fact that several previous comments on caching-hierarchy influence on tests were deleted & with a hope the censors would not delete a link to a similarly motivated Jean-François Puget's ( IBM France ) thorough experimentation to re-test Sebastian F. Walter's tests, but on a realistic sized matrices (where different caching strategies do show their edge)>>> ibm.com/developerworks/community/blogs/jfp/entry/… where SciPy+LAPACK show their remarkable edge on matrix sizes above 1000x1000. Commented Jan 7, 2017 at 5:22

3 Answers 3

9

This is a very broad question. Regarding the benchmark requests, you may be best off running a few small benchmarks yourself matching your own needs. To answer one of the questions:

One thing I notice is that Julia is using LLVM v3.3, while Numba uses llvmlite, which is built on LLVM v3.5. Does Julia's old LLVM prevent an optimum SIMD implementation on newer architectures, such as Intel Haswell (AVX2 instructions)?

[2017/01+: The information below no longer applies to current Julia releases]

Julia does turn off avx2 with LLVM 3.3 because there were some deep bugs on Haswell.

Julia is built with LLVM 3.3 for the current releases and nightlies, but you can build with 3.5, 3.6, and usually svn trunk (if we haven't yet updated for some API change on a given day, please file an issue). To do so, set LLVM_VER=svn (for example) in Make.user and then proceed to follow the build instructions.

Sign up to request clarification or add additional context in comments.

Comments

5

See here (section 4) for some peer-reviewed benchmarks which I personally worked on. The comparison was between Julia and PyPy.

1 Comment

I excluded PyPy from consideration because it doesn't support SciPy, matplotlib, 64-bit Windows+Python & Python 3.3+. In 2013, when the referenced paper was written, PyPy also didn't support BLAS & LAPACK. For scientific applications, I prefer to compare to CPython+SciPy+LLVM (Numba or NumbaPro).
-2

(Comparing uncomparable is always a dual-sided sword.

The below is presented in a fair belief that LLVM / JIT-powered code benchmarks ought be compared to some other LLVM / JIT-powered alternatives should any derived conclusion shall serve as a basis for reasonably supported decisions.)


Intro : ( numba stuff and [us] results come a bit lower down the page )

With all due respect, official site presents a tabulated set of performance testing, where two categories of facts are stated. The first, related to how the performance test was performed ( julia, using LLVM compiled code-execution v/s python, remaining a GIL-stepped, interpreted code-execution ). The second, how much longer do other languages take to complete the same "benchmark-task", using C-compiled code execution as a relative unit of time = 1.0

The chapter header, above a Table with results, says (cit.:)

High-Performance JIT Compiler
Julia’s LLVM-based just-in-time (JIT) compiler combined with the language’s design allow it to approach and often match the performance of C.

enter image description here
I thought a bit more rigorous to compare apples to apples and took just one of the "benchmark-task"-s, called the pi-sum.

This was the second worst time for interpreted python, presented to have run 21.99 times slower than a LLVM/JIT-compiled julia-code or C-compiled alternative.

So the small experimentation story started.

@numba.jit( JulSUM, nogil = True ):

Let's start to compare apples to apples. If julia code is reported to run 22x faster, let's first measure a plain interpreted python code run.

>>> def JulSUM():
...     sum = 0.
...     j   = 0
...     while j < 500:
...           j   += 1
...           sum  = 0.
...           k    = 0
...           while k < 10000:
...                 k   += 1
...                 sum += 1. / ( k * k )
...     return sum
...
>>> from zmq import Stopwatch
>>> aClk = Stopwatch()
>>> aClk.start();_=JulSUM();aClk.stop()
1271963L
1270088L
1279277L
1277371L
1279390L
1274231L

So, the core of the pi-sum runs about 1.27x.xxx [us] ~ about 1.27~1.28 [s]

Given the table row for pi-sum in language presentation on website, the LLVM/JIT-powered julia code execution ought run about 22x faster, i.e. under ~ 57.92 [ms]

>>> 1274231 / 22
57919

So, let's convert oranges to apples, using numba.jit ( v24.0 )

>>> import numba
>>> JIT_JulSUM = numba.jit( JulSUM )
>>> aClk.start();_=JIT_JulSUM();aClk.stop()
1175206L
>>> aClk.start();_=JIT_JulSUM();aClk.stop()
35512L
37193L
37312L
35756L
34710L

So, after JIT-compiler has made it's job, numba-LLVM'ed python exhibits benchmark times somewhere about 34.7 ~ 37.3 [ms]

Can we go farther?

Oh sure, we have not done much of the numba tweaking yet, while the code example is so trivial, not much surprising advances are expected to appear down the road.

First, let's remove the here unnecessary GIL-stepping:

>>> JIT_NOGIL_JulSUM = numba.jit( JulSUM, nogil = True )
>>> aClk.start();_=JIT_NOGIL_JulSUM();aClk.stop()
85795L
>>> aClk.start();_=JIT_NOGIL_JulSUM();aClk.stop()
35526L
35509L
34720L
35906L
35506L

nogil=True
does not bring the execution much farther,
but still shaves a few [ms] more, driving all results under ~ 35.9 [ms]

>>> JIT_NOGIL_NOPYTHON_JulSUM = numba.jit( JulSUM, nogil = True, nopython = True )
>>> aClk.start();_=JIT_NOGIL_NOPYTHON_JulSUM();aClk.stop()
84429L
>>> aClk.start();_=JIT_NOGIL_NOPYTHON_JulSUM();aClk.stop()
35779L
35753L
35515L
35758L
35585L
35859L

nopython=True
does just a final polishing touch
to get all results consistently under ~ 35.86 [ms] ( vs. ~57.92 [ms] for LLVM/JIT-julia )


Epilogue on DSP processing:

For the sake of the OP question about additional benefits for accelerated DSP-processing,
one may try and test numba + Intel Python ( via Anaconda ), where Intel has opened a new horizon in binaries, optimised for IA64-processor internalities, thus the code-execution may enjoy additional CPU-bound tricks, based on Intel knowledge of ILP4, vectorisation and branch-prediction details their own CPU-s exhibit in runtime. Worth a test to compare this ( plus one may enjoy their non-destructive code-analysis tool integrated into VisualStudio, where in-vitro code-execution hot-spots could be analysed in real-time -- a thing a DSP engineer would just love, wouldn't he/she?

14 Comments

Did you actually run the Julia code on your own machine? Which exact code? What was the timing? I suggest multiplying the workload by a factor of at least a hundred to have a fairer comparison.
( Yes, the 500x repeated 10k loop could be run much more times, however I kept the cited julia-lang site methodology 1:1 ).
Comparing Julia to numba is both sensible and interesting. But in order to do so, the codes must both obviously be run on the same machine.
For what it's worth, Julia 0.5 is twice as fast as numba on my machine for this particular micro-benchmark.
here's an example of an alternative approach, where perhaps GitHub is superior to StackOverflow for extended discussions and analysis.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.