Skip to content

Shared Benchmark suite for Pandas-like projects #19988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mrocklin opened this issue Mar 5, 2018 · 5 comments
Open

Shared Benchmark suite for Pandas-like projects #19988

mrocklin opened this issue Mar 5, 2018 · 5 comments
Labels
Benchmark Performance (ASV) benchmarks Performance Memory or execution speed performance

Comments

@mrocklin
Copy link
Contributor

mrocklin commented Mar 5, 2018

It would be valuable to have a benchmark suite for Pandas-like projects. This would help users reasonably compare the performance tradeoffs of different implementations and help developers identify possible performance issues.

There are, I think, a few axes that such a benchmark suite might engage:

  1. Operation type: filters, aggregations, random access, groupby-aggregate, set-index, merge, time series stuff, assignment, uniqueness, ...
  2. Datatype: grouping on ints, floats, strings, categoricals, etc.
  3. Cardinality: Lots of distinct floats, just a few common strings
  4. Data Size: How well do projects scale up? How well do they scale down?
  5. Cluster size: for those projects for which this is appropriate
  6. (probably lots of other things I'm missing)

Additionally, there are a few projects that I think might benefit from such an endeavor

  1. Pandas itself
  2. Newer pandas developments (whatever gets built on top of arrow memory), which may have enough API compatibility to take advantage of this?
  3. Pandas on Ray (see this nice blogpost: https://rise.cs.berkeley.edu/blog/pandas-on-ray/)
  4. Dask.dataframe
  5. Spark dataframes? If we can build in API tweaking (which I suspect will be necessary).

Some operational questions:

  1. How does one socially organize such a collection of benchmarks in a sensible way? My guess is that no one individual is likely to have time to put this together (though I would love to be proved wrong here). The objectives here are somewhat different from what currently lives in asv_bench/benchmarks.
  2. How does one consistently execute such a benchmark? I was looking at http://pytest-benchmark.readthedocs.io/en/latest
  3. What challenges are we likely to observe due to the differences in each project? How do we reasonably work around them?
  4. How do we avoid developer bias when forming benchmarks?
  5. Does anyone have enthusiasm about working on this?

Anyway, those are some thoughts. Please let me know if this is out of scope for this issue tracker.

@TomAugspurger
Copy link
Contributor

I'd like to see this, probably under the pandas-dev organization.

re organization: I like declarative files for describing the benchmarks and libraries (benchmarkees?).

What challenges are we likely to observe due to the differences in each project?

Possibly benchmarks that are appropriate for one library, but wildly inappropriate for another? I'm not sure.

How do we avoid developer bias when forming benchmarks?

If everyone's contributing it'll cancel out, right? :)

Does anyone have enthusiasm about working on this?

I can start something after the pandas 0.23 release in a few weeks.

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Needs Discussion Requires discussion from core team before further action Community and removed Needs Discussion Requires discussion from core team before further action labels Mar 5, 2018
@TomAugspurger
Copy link
Contributor

FYI: https://github.com/mm-mansour/Fast-Pandas

@mm-mansour nice work on that. Do you have any interest in expanding those benchmarks to alternative implementations (ray, dask, spark)? Some of these could complicate the setup code a decent amount, so "no" is a perfectly fine answer. Would you mind including a license in your repo, so that we can re-use pieces with credit back to the original (and because that's good OSS practice)?

@ziedbouf
Copy link

ziedbouf commented May 6, 2018

Any works still going on this, i am exploring ray and i found out that a lot of discussion going around benchmarks of several frameworks (Ray, dask, spark, etc...)?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 6, 2018 via email

@abi-aryan
Copy link

abi-aryan commented May 14, 2018

@mrocklin @ziedbouf I would love to work on it.

@jbrockmendel jbrockmendel added the Benchmark Performance (ASV) benchmarks label Jul 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

6 participants