Shared Benchmark suite for Pandas-like projects #19988

mrocklin · 2018-03-05T02:45:11Z

It would be valuable to have a benchmark suite for Pandas-like projects. This would help users reasonably compare the performance tradeoffs of different implementations and help developers identify possible performance issues.

There are, I think, a few axes that such a benchmark suite might engage:

Operation type: filters, aggregations, random access, groupby-aggregate, set-index, merge, time series stuff, assignment, uniqueness, ...
Datatype: grouping on ints, floats, strings, categoricals, etc.
Cardinality: Lots of distinct floats, just a few common strings
Data Size: How well do projects scale up? How well do they scale down?
Cluster size: for those projects for which this is appropriate
(probably lots of other things I'm missing)

Additionally, there are a few projects that I think might benefit from such an endeavor

Pandas itself
Newer pandas developments (whatever gets built on top of arrow memory), which may have enough API compatibility to take advantage of this?
Pandas on Ray (see this nice blogpost: https://rise.cs.berkeley.edu/blog/pandas-on-ray/)
Dask.dataframe
Spark dataframes? If we can build in API tweaking (which I suspect will be necessary).

Some operational questions:

How does one socially organize such a collection of benchmarks in a sensible way? My guess is that no one individual is likely to have time to put this together (though I would love to be proved wrong here). The objectives here are somewhat different from what currently lives in asv_bench/benchmarks.
How does one consistently execute such a benchmark? I was looking at http://pytest-benchmark.readthedocs.io/en/latest
What challenges are we likely to observe due to the differences in each project? How do we reasonably work around them?
How do we avoid developer bias when forming benchmarks?
Does anyone have enthusiasm about working on this?

Anyway, those are some thoughts. Please let me know if this is out of scope for this issue tracker.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-03-05T23:13:38Z

I'd like to see this, probably under the pandas-dev organization.

re organization: I like declarative files for describing the benchmarks and libraries (benchmarkees?).

What challenges are we likely to observe due to the differences in each project?

Possibly benchmarks that are appropriate for one library, but wildly inappropriate for another? I'm not sure.

How do we avoid developer bias when forming benchmarks?

If everyone's contributing it'll cancel out, right? :)

Does anyone have enthusiasm about working on this?

I can start something after the pandas 0.23 release in a few weeks.

TomAugspurger · 2018-03-19T14:32:47Z

FYI: https://github.com/mm-mansour/Fast-Pandas

@mm-mansour nice work on that. Do you have any interest in expanding those benchmarks to alternative implementations (ray, dask, spark)? Some of these could complicate the setup code a decent amount, so "no" is a perfectly fine answer. Would you mind including a license in your repo, so that we can re-use pieces with credit back to the original (and because that's good OSS practice)?

ziedbouf · 2018-05-06T08:30:27Z

Any works still going on this, i am exploring ray and i found out that a lot of discussion going around benchmarks of several frameworks (Ray, dask, spark, etc...)?

TomAugspurger · 2018-05-06T10:52:47Z

Not that I'm aware of.

…

On Sun, May 6, 2018 at 3:30 AM, ziedbouf ***@***.***> wrote: Any works still going on this, i am exploring ray and i found out that a lot of discussion going around benchmarks of several frameworks (Ray, dask, spark, etc...)? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#19988 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIhsl4kMnTl8JtkoIuJXh2UyYayJUks5tvrSngaJpZM4SbpqR> .

abi-aryan · 2018-05-14T09:54:52Z

@mrocklin @ziedbouf I would love to work on it.

TomAugspurger added Performance Memory or execution speed performance Needs Discussion Requires discussion from core team before further action Community and removed Needs Discussion Requires discussion from core team before further action labels Mar 5, 2018

jbrockmendel added the Benchmark Performance (ASV) benchmarks label Jul 25, 2018

mroeschke removed the Community label Apr 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Shared Benchmark suite for Pandas-like projects #19988

Shared Benchmark suite for Pandas-like projects #19988

mrocklin commented Mar 5, 2018

TomAugspurger commented Mar 5, 2018

Uh oh!

TomAugspurger commented Mar 19, 2018

Uh oh!

ziedbouf commented May 6, 2018

Uh oh!

TomAugspurger commented May 6, 2018 via email

Uh oh!

abi-aryan commented May 14, 2018 •

edited

Loading

Uh oh!

Uh oh!

Shared Benchmark suite for Pandas-like projects #19988

Shared Benchmark suite for Pandas-like projects #19988

Comments

mrocklin commented Mar 5, 2018

TomAugspurger commented Mar 5, 2018

Uh oh!

TomAugspurger commented Mar 19, 2018

Uh oh!

ziedbouf commented May 6, 2018

Uh oh!

TomAugspurger commented May 6, 2018 via email

Uh oh!

abi-aryan commented May 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abi-aryan commented May 14, 2018 •

edited

Loading