{"title": "Active Nearest-Neighbor Learning in Metric Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 856, "page_last": 864, "abstract": "We propose a pool-based non-parametric active learning algorithm for general metric spaces, called MArgin Regularized Metric Active Nearest Neighbor (MARMANN), which outputs a nearest-neighbor classifier. We give prediction error guarantees that depend on the noisy-margin properties of the input sample, and are competitive with those obtained by previously proposed passive learners. We prove that the label complexity of MARMANN is significantly lower than that of any passive learner with similar error guarantees. Our algorithm is based on a generalized sample compression scheme and a new label-efficient active model-selection procedure.", "full_text": "Active Nearest-Neighbor Learning in Metric Spaces\n\nAryeh Kontorovich\n\nDepartment of Computer Science\nBen-Gurion University of the Negev\n\nBeer Sheva 8499000, Israel\n\nSivan Sabato\n\nDepartment of Computer Science\nBen-Gurion University of the Negev\n\nBeer Sheva 8499000, Israel\n\nRuth Urner\n\nMax Planck Institute for Intelligent Systems\n\nDepartment for Empirical Inference\n\nT\u00fcbingen 72076, Germany\n\nAbstract\n\nWe propose a pool-based non-parametric active learning algorithm for gen-\neral metric spaces, called MArgin Regularized Metric Active Nearest Neighbor\n(MARMANN), which outputs a nearest-neighbor classi\ufb01er. We give prediction\nerror guarantees that depend on the noisy-margin properties of the input sample,\nand are competitive with those obtained by previously proposed passive learners.\nWe prove that the label complexity of MARMANN is signi\ufb01cantly lower than\nthat of any passive learner with similar error guarantees. Our algorithm is based\non a generalized sample compression scheme and a new label-ef\ufb01cient active\nmodel-selection procedure.\n\n1\n\nIntroduction\n\nIn this paper we propose a non-parametric pool-based active learning algorithm for general metric\nspaces, which outputs a nearest-neighbor classi\ufb01er. The algorithm is named MArgin Regularized\nMetric Active Nearest Neighbor (MARMANN). In pool-based active learning [McCallum and Nigam,\n1998] a collection of random examples is provided, and the algorithm can interactively query an\noracle to label some of the examples. The goal is good prediction accuracy, while keeping the label\ncomplexity (the number of queried labels) low. MARMANN receives a pool of unlabeled examples\nin a general metric space, and outputs a variant of the nearest-neighbor classi\ufb01er. The algorithm\nobtains a prediction error guarantee that depends on a noisy-margin property of the input sample, and\nhas a provably smaller label complexity than any passive learner with a similar guarantee.\nThe theory of active learning has received considerable attention in the past decade [e.g., Dasgupta,\n2004, Balcan et al., 2007, 2009, Hanneke, 2011, Hanneke and Yang, 2015]. Active learning has\nbeen mostly studied in a parametric setting (that is, learning with respect to a \ufb01xed hypothesis class\nwith a bounded capacity). Various strategies have been analyzed for parametric classi\ufb01cation [e.g.,\nDasgupta, 2004, Balcan et al., 2007, Gonen et al., 2013, Balcan et al., 2009, Hanneke, 2011, Awasthi\net al., 2013].An active model selection procedure has also been developed for the parametric setting\nBalcan et al. [2010]. However, the number of labels used there depends quadratically on the number\nof possible model classes, which is prohibitive in our non-parametric setting.\nThe potential bene\ufb01ts of active learning for non-parametric classi\ufb01cation in metric spaces are less well\nunderstood. The paradigm of cluster-based active learning [Dasgupta and Hsu, 2008] has been shown\nto provide label savings under some distributional clusterability assumptions [Urner et al., 2013,\nKpotufe et al., 2015]. Certain active learning methods for nearest neighbor classi\ufb01cation are known\nto be Bayes consistent [Dasgupta, 2012], and an active querying rule, based solely on information in\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe unlabeled data, has been shown to be bene\ufb01cial for nearest neighbors under covariate shift [Berlind\nand Urner, 2015]. Castro and Nowak [2007] analyze minimax rates for a class of distributions in\nEuclidean space, characterized by decision boundary regularity and noise conditions. However, no\nactive non-parametric strategy for general metric spaces, with label complexity guarantees for general\ndistributions, has been proposed so far. Here, we provide the \ufb01rst such algorithm and guarantees.\nThe passive nearest-neighbor classi\ufb01er is popular among theorists and practitioners alike [Fix and\nHodges, 1989, Cover and Hart, 1967, Stone, 1977, Kulkarni and Posner, 1995]. This paradigm is\napplicable in general metric spaces, and its simplicity is an attractive feature for both implementation\nand analysis. When appropriately regularized [e.g. Stone, 1977, Devroye and Gy\u00f6r\ufb01, 1985, von\nLuxburg and Bousquet, 2004, Gottlieb et al., 2010, Kontorovich and Weiss, 2015] this type of learner\ncan be made Bayes consistent. Another desirable property of nearest-neighbor-based methods is\ntheir ability to generalize at a rate that scales with the intrinsic data dimension, which can be much\nlower than that of the ambient space [Kpotufe, 2011, Gottlieb et al., 2014a, 2016a, Chaudhuri and\nDasgupta, 2014]. Furthermore, margin-based regularization makes nearest neighbors ideally suited\nfor sample compression, which yields a compact representation, faster classi\ufb01cation runtime, and\nimproved generalization performance [Gottlieb et al., 2014b, Kontorovich and Weiss, 2015]. The\nresulting error guarantees can be stated in terms of the sample\u2019s noisy-margin, which depends on the\ndistances between differently-labeled examples in the input sample.\nOur contribution. We propose MARMANN, a non-parametric pool-based active learning algorithm\nthat obtains an error guarantee competitive with that of a noisy-margin-based passive learner, but\ncan provably use signi\ufb01cantly fewer labels. This is the \ufb01rst non-parametric active learner for general\nmetric spaces that achieves prediction error that is competitive with passive learning for general\ndistributions, and provably improves label complexity.\nOur approach. Previous passive learning approaches to classi\ufb01cation using nearest-neighbor rules\nunder noisy-margin assumptions [Gottlieb et al., 2014b, 2016b] provide statistical guarantees using\nsample compression bounds [Graepel et al., 2005]. The \ufb01nite-sample guarantees depend on the\nnumber of noisy labels relative to an optimal margin scale. A central challenge in the active setting is\nperforming model selection (selecting the margin scale) with a low label complexity. A key insight\nthat we exploit in this work is that by designing a new labeling scheme for the compression set, we\ncan construct the compression set and estimate its error with label-ef\ufb01cient procedures. We obtain\nstatistical guarantees for this approach using a generalized sample compression analysis. We derive a\nlabel-ef\ufb01cient (as well as computationally ef\ufb01cient) active model-selection procedure. This procedure\n\ufb01nds a good scale by estimating the sample error for some scales, using a small number of active\nquerying rounds. Crucially, unlike cross-validation, our model-selection procedure does not require a\nnumber of labels that depends on the worst possible scale, nor does it test many scales. This allows\nour label complexity bounds to be low, and to depend only on the \ufb01nal scale selected by the algorithm.\nOur error guarantee is a constant factor over the error guarantee of the passive learner of Gottlieb\net al. [2016b]. An approach similar to Gottlieb et al. [2016b], proposed in Gottlieb et al. [2014a], has\nbeen shown to be Bayes consistent [Kontorovich and Weiss, 2015]. The Bayes-consistency of the\npassive version of our approach is the subject of ongoing work.\nPaper outline. We de\ufb01ne the setting and notations in Section 2. In Section 3 we provide our main\nresult, Theorem 3.2, giving error and label complexity guarantees for MARMANN. Section 4 shows\nhow to set the nearest neighbor rule for a given scale, and Section 5 describes the model selection\nprocedure. Some of the analysis is omitted due to lack of space. The full analysis is available at\nKontorovich et al. [2016].\n\n2 Setting and notations\nWe consider learning in a general metric space (X , \u03c1), where X is a set and \u03c1 is the metric on X .\nOur problem setting is that of classi\ufb01cation of the instance space X into some \ufb01nite label set Y.\nAssume that there is some distribution D over X \u00d7Y, and let S \u223c Dm be a labeled sample of size m,\nwhere m is an integer. Denote the sequence of unlabeled points in S by U(S). We sometimes treat S\nand U(S) as multisets, since the order is unimportant. The error of a classi\ufb01er h : X \u2192 Y on D is\ndenoted err(h,D) := P[h(X) (cid:54)= Y ], where (X, Y ) \u223c D. The empirical error on a labeled sample\nS instantiates to err(h, S) = 1|S|\nas input. An active learner receives the unlabeled part of the sample Uin := U(Sin) as input, and\n\n(cid:80) I[h(X) (cid:54)= Y ]. A passive learner receives a labeled sample Sin\n\n2\n\n\fis allowed to adaptively select examples from Uin and request their label from Sin. When either\nlearner terminates, it outputs a classi\ufb01er \u02c6h : X \u2192 Y, with the goal of achieving a low err(\u02c6h,D). An\nadditional goal of the active learner is to achieve a performance competitive with that of the passive\nlearner, while querying considerably fewer labels.\nThe diameter of a set A \u2286 X is de\ufb01ned by diam(A) := supa,a(cid:48)\u2208A \u03c1(a, a(cid:48)). Denote the index of the\nclosest point in U to x \u2208 X by \u03ba(x, U ) := argmini:xi\u2208U \u03c1(x, xi). We assume here and throughout\nthis work that when there is more than one minimizer for \u03c1(x, xi), ties are broken arbitrarily (but\nin a consistent fashion). For a set Z \u2286 X , denote \u03ba(Z, U ) := {\u03ba(z, U ) | z \u2208 Z}. Any labeled\nS : X \u2192 Y, via\nsample S = ((xi, yi))i\u2208[k] naturally induces the nearest-neighbor classi\ufb01er hnn\nS (x) := y\u03ba(x,U(S)).\nhnn\nFor x \u2208 X , and t > 0, denote by ball(x, t) the (closed) ball of radius t around x: ball(x, t) :=\n{x(cid:48) \u2208 X | \u03c1(x, x(cid:48)) \u2264 t}. The doubling dimension, the effective dimension of the metric space, which\ncontrols generalization and runtime performance of nearest-neighbors [Kpotufe, 2011, Gottlieb et al.,\n2014a], is de\ufb01ned as follows. Let \u03bb = \u03bb(X ) be the smallest number such that every ball in X\ncan be covered by \u03bb balls of half its radius, where all balls are centered at points of X . Formally,\n\u03bb(X ) := min{\u03bb \u2208 N : \u2200x \u2208 X , r > 0,\ni=1ball(xi, r/2)}.\nThen the doubling dimension of X is de\ufb01ned by ddim(X ) := log2 \u03bb. In line with modern literature,\nwe work in the low-dimension, big-sample regime, where the doubling dimension is assumed to\nbe constant and hence sample complexity and algorithmic runtime may depend on it exponentially.\nThis exponential dependence is unavoidable, even under margin assumptions, as previous analysis\n[Kpotufe, 2011, Gottlieb et al., 2014a] indicates.\nA set A \u2286 X is t-separated if inf a,a(cid:48)\u2208A:a(cid:54)=a(cid:48) \u03c1(a, a(cid:48)) \u2265 t. For A \u2286 B \u2286 X , the set A is a\na\u2208A ball(a, t). Constructing a minimum size t-net for\na general set B is NP-hard [Gottlieb and Krauthgamer, 2010], however ef\ufb01cient procedures exist\nfor constructing some t-net [Krauthgamer and Lee, 2004, Gottlieb et al., 2014b]. The size of\nany t-net is at most 2ddim(B) times the smallest possible size (see Kontorovich et al. [2016]). In\naddition, the size of any t-net is at most (cid:100)diam(B)/t(cid:101)ddim(X )+1 [Krauthgamer and Lee, 2004].\nThroughout the paper, we \ufb01x a deterministic procedure for constructing a t-net, and denote its\noutput for a multiset U \u2286 X by Net(U, t). Let Par(U, t) be a partition of X into regions induced\nby Net(U, t), that is: for Net(U, t) = {x1, . . . , xN}, de\ufb01ne Par(U, t) := {P1, . . . , PN}, where\nPi = {x \u2208 X | \u03ba(x, Net(U, t)) = i}. For t > 0, denote N (t) := |Net(Uin, t)|. For a labeled\nmultiset S \u2286 X \u00d7 Y and y \u2208 Y, denote Sy := {x | (x, y) \u2208 S}; in particular, U(S) = \u222ay\u2208Y Sy.\n\nt-net of B if A is t-separated and B \u2286 (cid:83)\n\n\u2203x1, . . . , x\u03bb \u2208 X : ball(x, r) \u2286 \u222a\u03bb\n\n3 Main results\n\nNon-parameteric binary classi\ufb01cation admits performance guarantees that scale with the sample\u2019s\nnoisy-margin [von Luxburg and Bousquet, 2004, Gottlieb et al., 2010, 2016b]. We say that a labeled\nmultiset S is (\u03bd, t)-separated, for \u03bd \u2208 [0, 1] and t > 0 (representing a margin t with noise \u03bd), if one\ncan remove a \u03bd-fraction of the points in S, and in the resulting multiset, points with different labels\nare at least t-far from each other. Formally, S is (\u03bd, t)-separated if there exists a subsample \u02dcS \u2286 S\nsuch that |S \\ \u02dcS| \u2264 \u03bd|S| and \u2200y1 (cid:54)= y2 \u2208 Y, a \u2208 \u02dcSy1 , b \u2208 \u02dcSy2 , we have \u03c1(a, b) \u2265 t. For a given\nlabeled sample S, denote by \u03bd(t) the smallest value \u03bd such that S is (\u03bd, t)-separated. Gottlieb et al.\n[2016b] propose a passive learner with the following guarantees as a function of the separation of S.\nSetting \u03b1 := m/(m \u2212 N ), de\ufb01ne the following form of a generalization bound:\n\n(cid:115)\n\n+\n\n3\u221a\n2\n\n2\n3\n\n(N + 1) log(mk) + log( 1\n\u03b4 )\n\nGB(\u0001, N, \u03b4, m, k) := \u03b1\u0001 +\nTheorem 3.1 (Gottlieb et al. [2016b]). Let m be an integer, Y = {0, 1}, \u03b4 \u2208 (0, 1). There exists a\n, where Spas \u2286 Sin, such\npassive learning algorithm that returns a nearest-neighbor classi\ufb01er hnn\nSpas\nthat, with probability 1 \u2212 \u03b4,\n\nm \u2212 N\n\nm \u2212 N\n\n\u03b1\u0001((N + 1) log(mk) + log( 1\n\n\u03b4 ))\n\n.\n\nerr(hnn\n\nSpas\n\n,D) \u2264\n\nmin\n\nt>0:N (t) 0 (found by searching over all scales), removing the |Sin|\u03bd(t) points that\n\n3\n\n\fobstruct the t-separation between different labels in Sin, and then selecting a subset of the remaining\nlabeled examples to form Spas, so that the examples are a t-net for Sin. We propose a different\napproach for generating a compression set for a nearest-neighbor rule. This approach, detailed in the\nfollowing sections, does not require \ufb01nding and removing all the obstructing points in Sin, and can\nbe implemented in an active setting using a small number of labels. The resulting active learning\nalgorithm, MARMANN, has an error guarantee competitive with that of the passive learner and a\nlabel complexity that can be signi\ufb01cantly lower. Our main result is the following guarantee for\nMARMANN.\nTheorem 3.2. Let Sin \u223c Dm, where m \u2265 max(6,|Y|), \u03b4 \u2208 (0, 1\n4 ). Let \u02c6S be the output of\nMARMANN(Uin, \u03b4), where \u02c6S \u2286 X \u00d7 Y, and let \u02c6N := | \u02c6S|. Let \u02c6h := hnn\nand \u02c6\u0001 := err(\u02c6h, Sin), and\n\u02c6S\ndenote \u02c6G := GB(\u02c6\u0001, \u02c6N , \u03b4, m, 1). With a probability of 1\u2212 \u03b4 over Sin and randomness of MARMANN,\n\n(cid:18)\n\n(cid:19)\n\nerr(\u02c6h,D) \u2264 2 \u02c6G \u2264 O\n\nGB(\u03bd(t),N (t), \u03b4, m, 1)\n\n,\n\n(cid:18)\n\nmin\n\nt>0:N (t) 0 is selected, by calling \u02c6t \u2190 SelectScale(\u03b4),\nwhere SelectScale is our model selection procedure. SelectScale has access to Uin, and queries labels\nfrom Sin as necessary. It estimates the generalization error bound GB for several different scales,\nand executes a procedure similar to binary search to identify a good scale. The binary search keeps\nthe number of estimations (and thus requested labels) small. Crucially, our estimation procedure\nis designed to prevent the search from spending a number of labels that depends on the net size\nof the smallest possible scale t, so that the total label complexity of MARMANN depends only\non error of the selected \u02c6t. Second, the selected scale \u02c6t is used to generate the compression set by\ncalling \u02c6S \u2190 GenerateNNSet(\u02c6t, [N (\u02c6t)], \u03b4), where GenerateNNSet is our compression set generation\nprocedure. For clarity of presentation, we \ufb01rst introduce in Section 4 the procedure GenerateNNSet,\nwhich determines the compression set for a given scale, and then in Section 5, we describe how\nSelectScale chooses the appropriate scale.\n\n4 Active nearest-neighbor at a given scale\n\nThe passive learner of Gottlieb et al. [2014a, 2016b] generates a compression set by \ufb01rst \ufb01nding\nand removing from Sin all points that obstruct (\u03bd, t)-separation at a given scale t > 0. We propose\nbelow a different approach for generating a compression set, which seems more conducive to active\nlearning: as we show below, it also generates a low-error nearest neighbor rule, just like the passive\napproach. At the same time, it allows us to estimate the error on many different scales using few\nlabel queries. A small technical difference, which will be evident below, is that in this new approach,\nexamples in the compression set might have a different label than their original label in Sin. Standard\nsample compression analysis [e.g. Graepel et al., 2005] assumes that the classi\ufb01er is determined by a\nsmall number of labeled examples from Sin. This does not allow the examples in the compression set\nto have a different label than their original label in Sin. Therefore, we require a slight generalization\nof previous compression analysis, which allows setting arbitrary labels for examples that are assigned\nto the compression set. The following theorem quanti\ufb01es the effect of this change on generalization.\n\n4\n\n\fTheorem 4.1. Let m \u2265 |Y| be an integer, \u03b4 \u2208 (0, 1\nif there exist N < m and S \u2286 (X \u00d7 Y)N such that U(S) \u2286 Uin and \u0001 := err(hnn\nerr(hnn\n\n4 ). Let Sin \u223c Dm. With probability at least 1 \u2212 \u03b4,\n2 , then\n\nS ,D) \u2264 GB(\u0001, N, \u03b4, m,|Y|) \u2264 2GB(\u0001, N, 2\u03b4, m, 1).\n\nS , Sin) \u2264 1\n\nIf the compression set\nThe proof is similar to that of standard sample compression schemes.\nincludes only the original labels, the compression analysis of Gottlieb et al. [2016b] gives the bound\nGB(\u0001, N, \u03b4, m, 1). Thus the effect of allowing the labels to change is only logarithmic in |Y|, and\ndoes not appreciably degrade the prediction error.\nWe now describe the generation of the compression set for a given scale t > 0. Recall that \u03bd(t) is\nthe smallest value for which Sin is (\u03bd, t)-separated. We de\ufb01ne two compression sets. The \ufb01rst one,\ndenoted Sa(t), represents an ideal compression set, which induces an empirical error of at most \u03bd(t),\nbut calculating it might require many labels. The second compression set, denoted \u02c6Sa(t), represents\nan approximation to Sa(t), which can be constructed using a small number of labels, and induces a\nsample error of at most 4\u03bd(t) with high probability. MARMANN constructs only \u02c6Sa(t), while Sa(t)\nis de\ufb01ned for the sake of analysis only.\nWe \ufb01rst de\ufb01ne the ideal set Sa(t) := {(x1, y1), . . . , (xN , yN )}. The examples in Sa(t) are the\npoints in Net(Uin, t/2), and the label of each example is the majority label out of the examples\nin Sin to which xi is closest. Formally, {x1, . . . , xN} := Net(Uin, t/2), and for i \u2208 [N ], yi :=\nargmaxy\u2208Y |Sy \u2229 Pi|, where Pi = {x \u2208 X | \u03ba(x, Net(U, t/2)) = i} \u2208 Par(Uin, t/2). For i \u2208 [N ],\nlet \u039bi := Syi \u2229 Pi. The following lemma bounds the empirical error of hnn\nLemma 4.2. For every t > 0, err(hnn\n\nSa(t).\n\nSa(t), Sin) \u2264 \u03bd(t).\n\nProof. Since Net(Uin, t/2) is a t/2-net, diam(P ) \u2264 t for any P \u2208 Par(Uin, t/2). Let \u02dcS \u2286 S\nbe a subsample that witnesses the (\u03bd(t), t)-separation of S, so that | \u02dcS| \u2265 m(1 \u2212 \u03bd(t)), and for\nany two points (x, y), (x(cid:48), y(cid:48)) \u2208 \u02dcS, if \u03c1(x, x(cid:48)) \u2264 t then y = y(cid:48). Denote \u02dcU := U( \u02dcS). Since\nmaxP\u2208Par(Uin,t/2) diam(P ) \u2264 t, for any i \u2208 [N ] all the points in \u02dcU \u2229 Pi must have the same label in\n\u02dcS. Therefore, \u2203y \u2208 Y such that \u02dcU \u2229 Pi \u2286 \u02dcSy \u2229 Pi. Hence | \u02dcU \u2229 Pi| \u2264 |\u039bi|. It follows\n\nSa(t), Sin) \u2264 |S| \u2212 (cid:88)\n\n|\u039bi| \u2264 |S| \u2212 (cid:88)\n\nm \u00b7 err(hnn\n\ni\u2208[N ]\n\ni\u2208[N ]\n\n| \u02dcU \u2229 Pi| = |S| \u2212 | \u02dcS| = m \u00b7 \u03bd(t).\n\nDividing by m we get the statement of the theorem.\n\nNow, calculating Sa(t) requires knowing most of the labels in Sin. MARMANN constructs instead\nan approximation \u02c6Sa(t), in which the examples are the points in Net(Uin, t/2) (so that U( \u02c6Sa(t)) =\nU(Sa(t)) ), but the labels are determined using a bounded number of labels requested from Sin. The\nlabels in \u02c6Sa(t) are calculated by the simple procedure GenerateNNSet given in Alg. 1. The empirical\nerror of the output of GenerateNNSet is bounded in Theorem 4.3 below.1\nA technicality in Alg. 1 requires explanation: In MARMANN, the generation of \u02c6Sa(t) will be split\ninto several calls to GenerateNNSet, so that different calls determine the labels of different points in\n\u02c6Sa(t). Therefore GenerateNNSet has an additional argument I, which speci\ufb01es the indices of the\npoints in Net(Uin, t/2) for which the labels should be returned this time. Crucially, if during the run of\nMARMANN, GenerateNNSet is called again for the same scale t and the same point in Net(Uin, t/2),\nthen GenerateNNSet returns the same label that it returned before, rather than recalculating it using\nfresh labels from Sin. This guarantees that despite the randomness in GenerateNNSet, the full\n\u02c6Sa(t) is well-de\ufb01ned within any single run of MARMANN, and is distributed like the output of\nGenerateNNSet(t, [N (t/2)], \u03b4), which is convenient for the analysis.\nTheorem 4.3. Let \u02c6Sa(t) be the output of GenerateNNSet(t, [N (t/2)], \u03b4). With a probability at least\n1 \u2212 \u03b4\n\n2m2 , we have err(hnn\n1In the case of binary labels (|Y| = 2), the problem of estimating Sa(t) can be formulated as a special case\nof the benign noise setting for parametric active learning, for which tight lower and upper bounds are provided\nin Hanneke and Yang [2015]. However, our case is both more general (as we allow multiclass labels) and more\nspeci\ufb01c (as we are dealing with a speci\ufb01c hypothesis class). Thus we provide our own procedure and analysis.\n\nS , Sin) \u2264 4\u03bd(t). Denote this event by E(t).\n\n5\n\n\fAlgorithm 1 GenerateNNSet(t, I, \u03b4)\ninput Scale t > 0, a target set I \u2286 [N (t/2)], con\ufb01dence \u03b4 \u2208 (0, 1).\noutput A labeled set S \u2286 X \u00d7 Y of size |I|\n\n{x1, . . . , xN} \u2190 Net(Uin, t/2), {P1, . . . , PN} \u2190 Par(Uin, t/2), S \u2190 ()\nfor i \u2208 I do\nif \u02c6yi has not already been calculated for Uin with this values of t then\n\nDraw Q :=(cid:6)18 log(2m3/\u03b4)(cid:7) points uniformly at random from Pi and query their labels.\n\nLet \u02c6yi be the majority label observed in these Q queries.\n\nend if\nS \u2190 S \u222a {(xi, \u02c6yi)}.\n\nend for\nOutput S\n\nSa(t), Sin) \u2264 \u03bd(t).\n\nProof. By Lemma 4.2, err(hnn\nIn Sa(t), the labels assigned to each point in\nNet(Uin, t/2) are the majority labels (based on Sin) of the points in the regions in Par(Uin, t/2).\nDenote the majority label for region Pi by yi := argmaxy\u2208Y |Sy \u2229 Pi|. We now compare these\nlabels to the labels \u02c6yi assigned by Alg. 1. Let p(i) = |\u039bi|/|Pi| be the fraction of points in Pi which\nare labeled by the majority label yi. Let \u02c6p(i) be the fraction of labels equal to yi out of those queried\nby Alg. 1 in round i. Let \u03b2 := 1/6. By Hoeffding\u2019s inequality and union bounds, we have that with a\nprobability of at least 1 \u2212 N (t/2) exp(\u2212 Q\n2m2 , we have maxi\u2208[N (t/2)] |\u02c6p(i) \u2212 p(i)| \u2264 \u03b2.\nDenote this \u201cgood\u201d event by E(cid:48). We now prove that E(cid:48) \u21d2 E(t). Let J \u2286 [N (t/2)] = {i | \u02c6p(i) > 1\n2}.\nIt can be easily seen that \u02c6yi = yi for all i \u2208 J. Therefore, for all x such that \u03ba(x, U(Sa(t))) \u2208 J,\nSa(t)(x), and hence err(hnn\nS (x) = hnn\nhnn\nSa(t), Uin).\nThe second term is at most \u03bd(t), and it remains to bound the \ufb01rst term, on the condition that E(cid:48) holds.\nWe have PX\u223cU [\u03ba(X, U(Sa(t))) /\u2208 J] = 1\n2 + \u03b2,\ntherefore |Pi| \u2212 |\u039bi| = (1 \u2212 p(i))|Pi| \u2265 ( 1\n|Pi|( 1\n\n18 ) \u2265 1 \u2212 \u03b4\n(cid:80)\ni /\u2208J |Pi|. If E(cid:48) holds, then for any i /\u2208 J, p(i) \u2264 1\n2 \u2212 \u03b2)|Pi|. Therefore\n2 \u2212 \u03b2) = PX\u223cU [\u03ba(X, U(Sa(t))) /\u2208 J]( 1\n(cid:80)\ni\u2208[N (t/2)] |\u039bi| \u2264 \u03bd(t). Thus, under E(cid:48),\n\nS , Uin) \u2264 PX\u223cUin [\u03ba(X, U(Sa(t))) /\u2208 J] + err(hnn\n\n|\u039bi| \u2265 1\nm\n\nOn the other hand, as in the proof of Lemma 4.2, 1 \u2212 1\nPX\u223cU [\u03ba(X, S) /\u2208 J] \u2264 \u03bd(t)\n2\u2212\u03b2\n\n= 3\u03bd(t). It follows that under E(cid:48), err(hnn\n\nS , Uin) \u2264 4\u03bd(t).\n\n(cid:88)\n\ni /\u2208J\n\n(cid:88)\n\ni /\u2208J\n\n1 \u2212 1\nm\n\n2 \u2212 \u03b2).\n\nm\n\n1\n\nm\n\n5 Model Selection\n\nWe now show how to select the scale \u02c6t that will be used to generate the output nearest-neighbor rule.\nThe main challenge is to do this with a low label complexity: Generating the full classi\ufb01cation rule\nfor scale t requires a number of labels that depends on N (t), which might be very large. We would\nlike the label complexity of MARMANN to depend only on N (\u02c6t) (where \u02c6t is the selected scale),\nwhich is of the order m \u02c6G. Therefore, during model selection we can only invest a bounded number\nof labels in each tested scale. In addition, to keep the label complexity low, we cannot test all scales.\nFor t > 0, let \u02c6Sa(t) be the model that MARMANN would generate if the selected scale were set to t.\nOur model selection procedure performs a search, similar to binary search, over the possible scales.\nFor each tested scale t, the procedure estimates \u0001(t) := err(hnn\n, S) within a certain accuracy, using\nan estimation procedure we call EstimateErr. EstimateErr outputs an estimate \u02c6\u0001(t) of \u0001(t), up to a\ngiven accuracy \u03b8 > 0, using labels requested from Sin. It draws random examples from Sin, asks for\ntheir label, and calls GenerateNNSet (which also might request labels) to \ufb01nd the prediction error\nof hnn\non these random examples. The estimate \u02c6\u0001(t) is set to this prediction error. The number\nof random examples drawn by EstimateErr is determined based on the accuracy \u03b8, using empirical\nBernstein bounds [Maurer and Pontil, 2009]. Theorem 5.1 gives a guarantee for the accuracy and\nlabel complexity of EstimateErr. The full implementation of EstimateErr and the proof of Theorem\n5.1 can be found in the long version of this paper Kontorovich et al. [2016].\n\n\u02c6Sa(t)\n\n\u02c6Sa(t)\n\n6\n\n\fTheorem 5.1. Let t, \u03b8 > 0 and \u03b4 \u2208 (0, 1), and let \u02c6\u0001(t) \u2190 EstimateErr(t, \u03b8, \u03b4). Let Q be as de\ufb01ned\nin Alg. 1. The following properties (which we denote below by V (t)) hold with a probability of\n1 \u2212 \u03b4\n\n2m2 over the randomness of EstimateErr (and conditioned on \u02c6Sa(t)).\n1. If \u02c6\u0001(t) \u2264 \u03b8, then \u0001(t) \u2264 5\u03b8/4. Otherwise, 4\u0001(t)\n\n5 \u2264 \u02c6\u0001(t) \u2264 4\u0001(t)\n3 .\n2. EstimateErr requests at most 520(Q+1) log( 1040m2\n\u03b4\u03c8(cid:48)\n\n)\n\n\u03c8(cid:48)\n\nlabels, where \u03c8(cid:48) := max(\u03b8, \u0001(t)).\n\nThe model selection procedure SelectScale, given in Alg. 2, implements its search based on the guar-\nantees in Theorem 5.1. First, we introduce some notation. Let G\u2217 = mint GB(\u03bd(t),N (t), \u03b4, m, 1).\nWe would like MARMANN to obtain a generalization guarantee that is competitive with G\u2217. Denote\n\u03c6(t) := ((N (t) + 1) log(m) + log( 1\nall \u0001, t,\n\n(cid:112)\u0001\u03c6(t). Note that for\n\n\u03b4 ))/m, and let G(\u0001, t) := \u0001 + 2\n\n3 \u03c6(t) + 3\u221a\n\n2\n\nGB(\u0001,N (t), \u03b4, m, 1) =\n\nm\n\nm \u2212 N (t)\n\nG(\u0001, t).\n\nWhen referring to G(\u03bd(t), t), G(\u0001(t), t), or G(\u02c6\u0001(t), t) we omit the second t for brevity. Instead of\ndirectly optimizing GB, we will select a scale based on our estimate G(\u02c6\u0001(t)) of G(\u0001(t)).\n\nLet Dist denote the set of pairwise distances in the unlabeled dataset Uin (note that |Dist| <(cid:0)m\n\nWe remove from Dist some distances, so that the remaining distances have a net size N (t) that is\nmonotone non-increasing in t. We also remove values with a very large net size. Concretely, de\ufb01ne\n\n(cid:1)).\n\n2\n\nDistmon := Dist \\ {t | N (t) + 1 > m/2} \\ {t | \u2203t(cid:48) \u2208 Dist, t(cid:48) < t and N (t(cid:48)) < N (t)}.\n\nThen for all t, t(cid:48) \u2208 Distmon such that t(cid:48) < t, we have N (t(cid:48)) \u2265 N (t). The output of SelectScale is\nalways a value in Distmon. The following lemma shows that it suf\ufb01ces to consider these scales.\nm \u2208 Distmon.\nLemma 5.2. Assume m \u2265 6 and let t\u2217\nm)) \u2264 G\u2217 \u2264\nProof. Assume by way of contradiction that t\u2217\n1/3 we have N (t\u2217\nm) + 1 \u2264 m/2.\nTherefore, by de\ufb01nition of Distmon there exists a t \u2264 t\u2217\nm). Since \u03bd(t) is monotone\nover all of t \u2208 Dist, we also have \u03bd(t) \u2264 \u03bd(t\u2217\nm) together\nimply that G(\u03bd(t)) < G(\u03bd(t\u2217\n\nm \u2208 argmint\u2208Dist G(\u03bd(t)). If G\u2217 \u2264 1/3 then t\u2217\nm \u2208 Dist \\ Distmon. First, since G(\u03bd(t\u2217\n2. Therefore, since m \u2265 6, it is easy to verify N (t\u2217\n\nm with \u03c6(t) < \u03c6(t\u2217\nm \u2208 Distmon.\n\nm)), a contradiction. Hence, t\u2217\n\nm). Now, \u03c6(t) < \u03c6(t\u2217\n\nm) and \u03bd(t) \u2264 \u03bd(t\u2217\n\nm) log(m) \u2264 1\n\nm)+1\nm\u2212N (t\u2217\n\nSelectScale follows a search similar to binary search, however the conditions for going right and for\ngoing left are not complementary. The search ends when either none of these two conditions hold, or\nwhen there is nothing left to try. The \ufb01nal output of the algorithm is based on minimizing G(\u02c6\u0001(t))\nover some of the values tested during search.\nFor c > 0, de\ufb01ne \u03b3(c) := 1 + 2\nimplications\n\n. For all t, \u0001 > 0 we have the\n\nand \u02dc\u03b3(c) := 1\n\n3c + 3\u221a\n\n3 + 3\u221a\n\nc + 2\n\n2c\n\n2c\n\n\u0001 \u2265 c\u03c6(t) \u21d2 \u03b3(c)\u0001 \u2265 G(\u0001, t)\n\nand\n\n\u03c6(t) \u2265 c\u0001 \u21d2 \u02dc\u03b3(c)\u03c6(t) \u2265 G(\u0001, t).\n\n(1)\n\nThe following lemma uses Eq. (1) to show that the estimate G(\u02c6\u0001(t)) is close to the true G(\u0001(t)).\nLemma 5.3. Let t > 0, \u03b4 \u2208 (0, 1), and suppose that SelectScale calls \u02c6\u0001(t) \u2190\n6 G(\u02c6\u0001(t)) \u2264\nEstimateErr(t, \u03c6(t), \u03b4). Suppose that V (t) as de\ufb01ned in Theorem 5.1 holds. Then 1\nG(\u0001(t)) \u2264 6.5G(\u02c6\u0001(t)).\nProof. Under V (t), we have that if \u02c6\u0001(t) < \u03c6(t) then \u0001(t) \u2264 5\n\u02dc\u03b3(4/5)\u03c6(t) \u2264 4.3\u03c6(t), by Eq. (1). Therefore G(\u0001(t)) \u2264 3\u00b74.3\n(from the de\ufb01nition of G), and by Eq. (1) and \u02dc\u03b3(1) \u2264 4, \u03c6(t) \u2265 1\n6 G(\u02c6\u0001(t)). On the other hand, if \u02c6\u0001(t) \u2265 \u03c6(t), then by Theorem 5.1 4\nG(\u02c6\u0001(t)) \u2264 4\nthe bounds in the lemma.\nThe next theorem bounds the label complexity of SelectScale. Let Ttest \u2286 Distmon be the set of\nscales that are tested during SelectScale (that is, their \u02c6\u0001(t) was estimated).\n\nIn this case, G(\u0001(t)) \u2264\n4 \u03c6(t).\n3 \u03c6(t)\n4 G(\u02c6\u0001(t)). Therefore G(\u0001(t)) \u2265\n5 \u0001(t) \u2264 \u02c6\u0001(t) \u2264 4\n3 \u0001(t). Therefore\n4 G(\u02c6\u0001(t)). Taking the worst-case of both possibilities, we get\n\n2 G(\u02c6\u0001(t)). In addition, G(\u0001(t)) \u2265 2\n\n3 G(\u0001(t)) and G(\u0001(t)) \u2264 5\n\n1\n\n7\n\n\fAlgorithm 2 SelectScale(\u03b4)\ninput \u03b4 \u2208 (0, 1)\noutput Scale \u02c6t\nT \u2190 Distmon,\nwhile T (cid:54)= \u2205 do\n\n# T maintains the current set of possible scales\n\n# break ties arbitrarily\n\nt \u2190 the median value in T\n\u02c6\u0001(t) \u2190 EstimateErr(t, \u03c6(t), \u03b4).\nif \u02c6\u0001(t) < \u03c6(t) then\n\nelse if \u02c6\u0001(t) > 11\n\n10 \u03c6(t) then\n\nT \u2190 T \\ [0, t] # go right in the binary search\nT \u2190 T \\ [t,\u221e) # go left in the binary search\nt0 \u2190 t,T0 \u2190 {t0}.\nbreak from loop\n\nelse\n\nend if\n\nend while\nif T0 was not set yet then\n\nIf the algorithm ever went to the right, let t0 be the last value for which this happened, and let\nT0 := {t0}. Otherwise, T0 := \u2205.\n\nend if\nLet TL be the set of all t that were tested and made the search go left\nOutput \u02c6t := argmint\u2208TL\u222aT0 G(\u02c6\u0001(t))\n\nTheorem 5.4. Suppose that the event V (t) de\ufb01ned in Theorem 5.1 holds for all t \u2208 Ttest for the calls\n\u02c6\u0001(t) \u2190 EstimateErr(t, \u03c6(t), \u03b4). If the output of SelectScale is \u02c6t, then the number of labels requested\nby SelectScale is at most\n\n19240|Ttest|(Q + 1)\n\n1\n\nG(\u0001(\u02c6t))\n\nlog(\n\n38480m2\n\u03b4G(\u0001(\u02c6t))\n\n),\n\nwhere Q is as de\ufb01ned in Alg. 1.\n\nThe following theorem provides a competitive error guarantee for the selected scale \u02c6t.\nTheorem 5.5. Suppose that V (t) and E(t), de\ufb01ned in Theorem 5.1 and Theorem 4.3, hold for all\nvalues t \u2208 Ttest, and that G\u2217 \u2264 1/3. Then SelectScale outputs \u02c6t \u2208 Distmon such that\n\nGB(\u0001(\u02c6t),N (\u02c6t), \u03b4, m, 1) \u2264 O(G\u2217),\n\nwhere O(\u00b7) hides numerical constants only.\nThe idea of the proof is as follows: First, we show (using Lemma 5.3) that it suf\ufb01ces to prove that\nm)) \u2265 O(G(\u02c6\u0001(\u02c6t))) to derive the bound in the theorem. Now, SelectScale ends in one of two\nG(\u03bd(t\u2217\ncases: either T0 is set within the loop, or T = \u2205 and T0 is set outside the loop. In the \ufb01rst case,\nneither of the conditions for turning left and turning right holds for t0, so we have \u02c6\u0001(t0) = \u0398(\u03c6(t0))\nm \u2264 t0,\n(where \u0398 hides numerical constants). We show that in this case, whether t\u2217\nm)) \u2265 O(G(\u02c6\u0001(t0))). In the second case, there exist (except for edge cases, which are also\nG(\u03bd(t\u2217\nhandled) two values t0 \u2208 T0 and t1 \u2208 TL such that t0 caused the binary search to go right, and t1\ncaused it to go left, and also t0 \u2264 t1, and (t0, t1) \u2229 Distmon = \u2205. We use these facts to show that for\nm \u2265 t1, G(\u03bd(t\u2217\nm)) \u2265 O(G(\u02c6\u0001(t0))). Since \u02c6t minimizes\nt\u2217\nover a set that includes t0 and t1, this gives G(\u03bd(t\u2217\nThe proof of the main theorem, Theorem 3.2, which gives the guarantee for MARMANN, is almost\nimmediate from Theorem 4.1, Theorem 4.3, Theorem 5.5 and Theorem 5.4.\n\nm)) \u2265 O(G(\u02c6\u0001(\u02c6t))) in all cases.\n\nm)) \u2265 O(G(\u02c6\u0001(t1))), and for t\u2217\n\nm \u2264 t0, G(\u03bd(t\u2217\n\nm \u2265 t0 or t\u2217\n\nAcknowledgements\n\nSivan Sabato was partially supported by the Israel Science Foundation (grant No. 555/15). Aryeh\nKontorovich was partially supported by the Israel Science Foundation (grants No. 1141/12 and\n755/15) and a Yahoo Faculty award. We thank Lee-Ad Gottlieb and Dana Ron for helpful discussions.\n\n8\n\n\fReferences\nP. Awasthi, M.-F. Balcan, and P. M. Long. The power of localization for ef\ufb01ciently learning linear separators\n\nwith malicious noise. CoRR, abs/1307.8371, 2013.\n\nM. Balcan, S. Hanneke, and J. W. Vaughan. The true sample complexity of active learning. Machine Learning,\n\n80(2-3):111\u2013139, 2010.\n\nM.-F. Balcan, A. Broder, and T. Zhang. Margin-based active learning. In COLT, 2007.\nM.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. Syst. Sci., 75(1), 2009.\nC. Berlind and R. Urner. Active nearest neighbors in changing environments. In ICML, pages 1870\u20131879, 2015.\nR. M. Castro and R. D. Nowak. Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007,\nSan Diego, CA, USA; June 13-15, 2007. Proceedings, chapter Minimax Bounds for Active Learning, pages\n5\u201319. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.\n\nK. Chaudhuri and S. Dasgupta. Rates of convergence for nearest neighbor classi\ufb01cation. In NIPS, 2014.\nT. M. Cover and P. E. Hart. Nearest neighbor pattern classi\ufb01cation. IEEE Transactions on Information Theory,\n\n13:21\u201327, 1967.\n\nS. Dasgupta. Analysis of a greedy active learning strategy. In NIPS, pages 337\u2013344, 2004.\nS. Dasgupta. Consistency of nearest neighbor classi\ufb01cation under selective sampling. In COLT, 2012.\nS. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In ICML, pages 208\u2013215, 2008.\nL. Devroye and L. Gy\u00f6r\ufb01. Nonparametric density estimation: the L1 view. Wiley Series in Probability and\n\nMathematical Statistics: Tracts on Probability and Statistics. John Wiley & Sons, Inc., New York, 1985.\n\nL. Devroye, L. Gy\u00f6r\ufb01, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31 of Applications\n\nof Mathematics (New York). Springer-Verlag, New York, 1996. ISBN 0-387-94618-7.\n\nE. Fix and J. Hodges, J. L. Discriminatory analysis. nonparametric discrimination: Consistency properties.\n\nInternational Statistical Review / Revue Internationale de Statistique, 57(3):pp. 238\u2013247, 1989.\n\nA. Gonen, S. Sabato, and S. Shalev-Shwartz. Ef\ufb01cient active learning of halfspaces: an aggressive approach.\n\nJournal of Machine Learning Research, 14(1):2583\u20132615, 2013.\n\nL. Gottlieb and R. Krauthgamer. Proximity algorithms for nearly-doubling spaces. In APPROX-RANDOM,\n\npages 192\u2013204, 2010.\n\nL. Gottlieb, L. Kontorovich, and R. Krauthgamer. Ef\ufb01cient classi\ufb01cation for metric data. In COLT, pages\n\n433\u2013440, 2010.\n\nL. Gottlieb, A. Kontorovich, and R. Krauthgamer. Ef\ufb01cient classi\ufb01cation for metric data. IEEE Transactions on\n\nInformation Theory, 60(9):5750\u20135759, 2014a.\n\nL. Gottlieb, A. Kontorovich, and P. Nisnevitch. Near-optimal sample compression for nearest neighbors. In\n\nNIPS, pages 370\u2013378, 2014b.\n\nL. Gottlieb, A. Kontorovich, and R. Krauthgamer. Adaptive metric dimensionality reduction. Theoretical\n\nComputer Science, pages 105\u2013118, 2016a.\n\nL. Gottlieb, A. Kontorovich, and P. Nisnevitch. Nearly optimal classi\ufb01cation for semimetrics. In Arti\ufb01cial\n\nIntelligence and Statistics (AISTATS), 2016b.\n\nT. Graepel, R. Herbrich, and J. Shawe-Taylor. PAC-Bayesian compression bounds on the prediction error of\n\nlearning algorithms for classi\ufb01cation. Machine Learning, 59(1-2):55\u201376, 2005.\n\nS. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333\u2013361, 2011.\nS. Hanneke and L. Yang. Minimax analysis of active learning. JMLR, 16:3487\u20133602, 2015.\nA. Kontorovich and R. Weiss. A Bayes consistent 1-NN classi\ufb01er. In AISTATS, 2015.\nA. Kontorovich, S. Sabato, and R. Urner. Active nearest-neighbor learning in metric spaces. CoRR,\n\nabs/1605.06792, 2016. URL http://arxiv.org/abs/1605.06792.\nS. Kpotufe. k-NN regression adapts to local intrinsic dimension. In NIPS, 2011.\nS. Kpotufe, R. Urner, and S. Ben-David. Hierarchical label queries with data-dependent partitions. In COLT,\n\npages 1176\u20131189, 2015.\n\nR. Krauthgamer and J. R. Lee. Navigating nets: Simple algorithms for proximity search. In 15th Annual\n\nACM-SIAM Symposium on Discrete Algorithms, pages 791\u2013801, Jan. 2004.\n\nS. R. Kulkarni and S. E. Posner. Rates of convergence of nearest neighbor estimation under arbitrary sampling.\n\nIEEE Transactions on Information Theory, 41(4):1028\u20131039, 1995.\n\nA. Maurer and M. Pontil. Empirical Bernstein bounds and sample-variance penalization. In COLT, 2009.\nA. K. McCallum and K. Nigam. Employing EM and pool-based active learning for text classi\ufb01cation. In ICML,\n\n1998.\n\nC. J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595\u2013620, 1977.\nR. Urner, S. Wulff, and S. Ben-David. PLAL: cluster-based active learning. In COLT, pages 376\u2013397, 2013.\nU. von Luxburg and O. Bousquet. Distance-based classi\ufb01cation with Lipschitz functions. Journal of Machine\n\nLearning Research, 5:669\u2013695, 2004.\n\n9\n\n\f", "award": [], "sourceid": 527, "authors": [{"given_name": "Aryeh", "family_name": "Kontorovich", "institution": "Ben Gurion University"}, {"given_name": "Sivan", "family_name": "Sabato", "institution": "Ben-Gurion University of the Negev"}, {"given_name": "Ruth", "family_name": "Urner", "institution": "MPI Tuebingen"}]}