More

treo · on June 29, 2018

And my reply on the issue: https://www.dubs.tech/blog/benchmarking-nd4j-and-neanderthal...

dragandj · on June 29, 2018

FWIW, a reply to the reply: https://dragan.rocks/articles/18/Neanderthal-vs-ND4J-vol3

treo · on June 23, 2018

Awesome! I didn't expect you to write that blog post so soon. It will be interesting to find out where the differences (beyond the 'f' ordering thing) are that make Neanderthal so fast, since when I originally started the comparison, both basically boiled down to calling MKL through JNI.

For completeness sake, could you also provide some information about your machine specs and operating system (and if on linux your glibc and kernel version)?

treo · on June 23, 2018

It looks like while converting from my benchmarking code you've dropped the 'f' when creating the resulting array.

https://github.com/treo/benchmarking_nd4j/blob/master/src/ma...

The difference is rather huge with the newer versions of nd4j.

While the numbers in the following gists do not contain the measurements I took for neanderthal, they do contain the numbers that I got for ND4J.

Without f ordering: https://gist.github.com/treo/1fab39f213da26255cf4f75e383ff90...

With f ordering: https://gist.github.com/treo/94fe92c9417b5c8b24baa12924a35b0...

As you can see something happened in the time between the 0.4 release (I took that as the comparison point since that was when I ran my own benchmarks the last time) and the 0.9.1 release that introduced additional overhead.

Originally I planned to create my own write-up on this, but I wanted to first to find out what happened there.

Given that ND4J is mainly used inside of DL4J and the matrix sizes it is used with usually are rather large, the performance overhead difference that I've observed there for tiny multiplications isn't necessarily that bad, as the newer version performs much better on larger matrices.

dragandj · on June 23, 2018

You're right. In that particular case, ND4J comes to Neanderthal's speed. But only in that particular case; and even then ND4J is still not faster than Neanderthal. My initial quest was to find out whether ND4J can be faster than Neanderthal, and I still couldn't find a case where it is.

Although, to my defense, the option in question here is very poorly documented. I've found the ND4J tutorial page where it's mentioned, and even after re-reading the sentence multiple times, I still do not connect its description to what it (seems to) actually do. It also does not mention that it affects computation speed.

Anyway, I'm looking forward to reading your detailed analysis, and especially seeing your Neanderthal numbers.

treo · on June 23, 2018

Do you have any pointer on how you've profiled Neanderthal during development?

When I originally set out to compare ND4J and Neanderthal, I've ran into the issue that I bottomed out at: they basically both call MKL (or Openblas) for BLAS operations.

agibsonccc · on June 23, 2018

Fair point we are fixing now: https://github.com/deeplearning4j/deeplearning4j-docs/issues...

We will be sending out a doc for this by next week with these updates. Thanks a lot for playing ball here.

Beyond that, can you clarify what you mean? Do you mean just the gemm op?

For that, that's the only case that mattered for us. We will be documenting the what/how/why of this in our docs.

Beyond that, I'm not convinced the libraries are directly comparable when it comes to the sheer scope of the libraries to each other.

You're treating nd4j as a gemm library rather than a fully fledged numpy/tensorflow with hundreds of ops and support for things you would likely have no interest in building.

A big reason I built nd4j was to solve the general use case of building a tensor library for deep learning, not just a gemm library.

Beyond that - I'll give you props for what you built. There's always lessons to learn when comparing libraries and making sure the numbers match.

Our target isn't you though, it's the likes of google,facebook, and co and tackling the scope of tasks they are.

That being said - could we spend some time on docs? Heck yeah we should. At most we have java doc and examples. We tend to help people as much as we can when profiling.

Could we manage it better? Yes for sure. That's partially why we moved dl4j to the eclipse foundation to get more 3rd party contributions and build a better governance setup. Will it take time for all of this to evolve? Oh yeah most definitely.

No project is perfect and always has things it could improve on.

Anyways - let's be clear here. You're a one man shop who built an amazingly fast library that scratches your own itch for a very specific set of use cases. We're a company and community tackling a wider breadth of tasks and trying to focus more on serving customers and adding odd things like different kinds of serialization, spark interop,.. etc.

We benefit from doing these comparisons and it forces us to document things better that we normally don't pay attention to. This little exercise is good for us. As mentioned, we will document the limitations a bit better but we will make sure to cover other topics like allocation and the like as well as the blas interface.

Positive change has come out of this and I'd like to thank you for the work you put in. We will make sure to re run some of the comaprisons on our side.

dragandj · on June 23, 2018

Sure. I agree. You as a company have to look at your bottom line above all. Nothing wrong with that.

Please also note that Neanderthal also has hundreds of operations. The set of use cases where it scratches itches might be wider and more general than you think.

The reasons I'm showcasing matrix multiplications are:

1. That's what you used in the comparison. 2. It is a good proxy for the overall performance. If matrix multiplication is poor, other operations tend to be even poorer :)

Anyway, as I said, I'll be glad to compare other operations that ND4J excells at, or that anyone think are important.

I would also like to see ND4J's comparisons with Tensorflow or Numpy, or PyTorch, or, JVM based MXNet.

agibsonccc · on June 23, 2018

Yeah we definitely need to spend some more time on benchmarks after all it's said and done.

That being said, while gemm is one op, it's a lot more than just jni back and forth that use other libraries. What matters here are also things like convolutions, pair wise distance calculations, element wise ops, etc.

There's nuance there.

There are multiple layers here to consider:

1. The JNI interop managed via javacpp (relevant to this discussion)

2. Every op has allocation vs in place trade offs to consider

3. For our python interface, we have yet another layer to benchmark there (we use pyjnius for jumpy the python interface for nd4j)

4. Op implementations for the cuda kernels and the custom cpu ops we wrote. (That's where our avx512 and avx2 jars matter for example)

For the subset we are comparing against, it's basically making sure we wrap the blas calls properly. That's definitely something we should be doing.

We've profiled that and chose the pattern you're seeing above with f ordering.

That is where we are fast and chose to optimize for. You are faster in those other cases and have laid that out very well.

Again, there's still a lot that was learned here and I will post the doc when we get it out there to make that less painful next time.

You made a great post here and really laid out the trade offs.

I wish we had more time to run benchmarks beyond timing for our own use cases, if we had smaller scope we would definitely focus on every case you're mentioning here. We likely will revisit this at some point if we find it worth it.

In general, our communications and docs can always be improved (especially our internals like our memory allocation)

Re: your last point we do do this kind of benchmarking with tensorflow. For example: https://www.slideshare.net/agibsonccc/deploying-signature-ve... (see slide 3 and also the broader slides for an idea of how we profile deep learning for apps using the jvm)

We need to do a better job of maintaining these things though. We don't keep it up to date and don't profile as much as we should. It has diminishing returns after a certain point vs building other features.

I'm hoping a CI build to generate these things is something we get done this year so we can both prevent performance regressions and have consistent numbers we can publish for the docs.

Once the python interface is done that will be easier to do and justify since most of our "competition" is in python.

treo · on Nov 9, 2017

Looks like they are starting to come back up. My VPS is accessible again.

treo · on May 28, 2012

Your tests, or the output they produce, look awesome. Can you elaborate on your setup? Just today I was looking for an up to date introduction for testing in django, but could only find some older blog posts. So I would be glad if you could detail your setup a bit.

igorgue · on May 29, 2012

I'll write it and let you guys know :-)

I've basically been following some Ruby people for the last couple of months - and worked at a Ruby shop myself last year - they happen to know a lot more about testing than us, pythonistas, but hey, it's very good to learn, at least Mock being included into Python3 is a step in the right (well that depends what camp are you in) direction.

stavros · on May 28, 2012

The first one looks like nose with coverage, and the second like pinocchio with coverage. There are various ways to get those running with Django, but yes, a blog post would be nice.

Just don't get all caught up in 100% test coverage, it's not very useful. You should test edge cases that run the same code more than obvious cases that run new code.

treo · on Jan 22, 2012

The swiss have a direct democracy and are not a theocracy.

treo · on Dec 15, 2011

HP has announced something for 2013: http://www.eetimes.com/electronics-news/4229171/HP-Hynix-to-...

Symmetry · on Dec 15, 2011

HP is still accumulating endurance cycle data at 10^12 cycles and the retention times are measured in years, he said. For all the people worried about the Semiaccurate article talking about only a billion read/write cycles, this shows that endurance is a very tuneable parameter, just like it is with flash memory. With flash you have some cells that are only good for 5K writes, and you have some that are good for 1M writes. With RRAM it looks like the numbers will be much higher.

treo · on Nov 8, 2011

Last commit on August 04, 2011. https://github.com/bjpop/berp

wbhart · on Nov 8, 2011

There is a commit to the new-identity branch on Oct 19th 2011.

treo · on Oct 19, 2011

I would also love a more ebook reader friendly format like Mobi or ePub. I still bought it, because I might be able to convert it to something usable for myself, but a version that is actually made for ebook readers (like the sony prs-650 that I have myself :)) would be even better.

treo · on Oct 5, 2011

http://translate.google.com/translate?sl=auto&tl=en&...

This are their current goals. They want to allow private copying. So selling the copy would still be illegal.

Concerning patents: "We reject patents unanimously onto organisms and genes on business ideas and also to software"