wip by massich · Pull Request #5 · massich/scikit-learn

massich · 2017-09-14T10:15:22Z

This is a PR regarding this concern by @glemaitre.

The main concern is that in order to do a check into a sparse matrix, this needs to be materialized and @glemaitre is wandering if it could be computed directly from the sparse matrix. The underlying computation is this one:

mode = np.array([stats.mode(pl)[0] for pl in pred_labels], dtype=np.int)

which could be changed (for the sparse case) by something this:

def get_sparse_matrix_mode (x_sparse):
      mode, occ = stats.mode(x_sparse.data)
      n_of_zero_values = np.product(x_sparse.shape) - x_spase.nnz
      return mode if occ > n_of_zero_values else 0

mode = np.array([get_sparse_matrix_mode(pl) for pl in pred_labels])

The problem comes when the instances of 0 need to be waited. Which is the case here. And any feedback is wellcome.

Some notes, that I don't know how to include into the discussion but that are important when taking a decission. (They are listed with no particular order).
1 - The case of sparase matrix and weights is never tested. See this breakpoint and travis still all green.

2 - Two different signatures of self._mode: KNeighborsClassifier::_mode (self,neigh,weights) and RadiusNeighborsClassifier::_mode (self,pred_labels,weights,inliers). More over neither of them use self inside. So shouldn't we unify the signature and use it as a free function. Or in case of really being a class method shouldn't they have the same signature and be added to a parent class?

3 - In order to unify the call and simply the code weights could always be provided (at expenses of some computing time) and we could even add sparse support to sklearn.utils.extmath.weighted_mode:

>>> from scipy import sparse
>>> import numpy as np
>>> from sklearn.utils.extmath import weighted_mode
>>> from scipy.stats import mode

>>> x = np.random.random((1000,))
>>> y = np.ones((1000,))
>>> %timeit weighted_mode(x,y)
30.6 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit mode(x)
25.1 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> x = np.random.random((100000,))
>>> y = np.ones((100000,))
>>> %timeit weighted_mode(x,y)
30.1 s ± 2.22 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit mode(x)
15.5 s ± 90.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pep8speaks · 2017-09-14T11:56:09Z

Hello @massich! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on September 14, 2017 at 12:23 Hours UTC

jnothman · 2017-09-14T22:34:29Z

I've not understood your entire discussion in a hurry, but:

in kneighbors the data being operated on is rectangular. In radius_neighbors it is not.
if we keep things sparse in the radius_neighbors case when calculating the mode, it is operating on a CSR or CSC matrix where one dimension is 1.
if the data is binary and the matrix canonically has any zeroes eliminated, then the mode can be calculated from shape and nnz alone.
if zeros are not eliminated, you need bincount to calculate the mode (although if it's binary, you can also calculate it from data.mean() but that's tricky and unreadable)

I don't mind if we explicitly calculate the sparse mode, but as long as avg_neighbours << n_classes, it's not helping much.

jnothman · 2017-09-14T22:35:16Z

In practice, we have little-to-no testing of scikit-learn with sparse matrices where zeros haven't been eliminated. But I try to keep it in mind anyway...

massich force-pushed the 8057_glemaitre_feedback branch from c54274b to d467359 Compare September 14, 2017 11:56

massich mentioned this pull request Sep 14, 2017

[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers scikit-learn/scikit-learn#9059

Open

wip

a03237d

massich force-pushed the 8057_glemaitre_feedback branch from d467359 to a03237d Compare September 14, 2017 12:23

massich force-pushed the 8057 branch 2 times, most recently from 2192eb7 to c4a47e3 Compare September 15, 2017 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip#5

wip#5
massich wants to merge 1 commit into8057from
8057_glemaitre_feedback

massich commented Sep 14, 2017 •

edited

Loading

Uh oh!

pep8speaks commented Sep 14, 2017 •

edited

Loading

Uh oh!

jnothman commented Sep 14, 2017

Uh oh!

jnothman commented Sep 14, 2017 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

massich commented Sep 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Sep 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on September 14, 2017 at 12:23 Hours UTC

Uh oh!

jnothman commented Sep 14, 2017

Uh oh!

jnothman commented Sep 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

massich commented Sep 14, 2017 •

edited

Loading

pep8speaks commented Sep 14, 2017 •

edited

Loading

jnothman commented Sep 14, 2017 •

edited

Loading