For the mathematical analysis, however, parametric and nonparametric approaches fit into the same setting: Assuming that the function that is to be estimated (data distribution or regression function) belongs to a set of functions parametrized by a set
, one searches for a (measurable) function
that estimates the "true" parameter based on data points
. The key difference between parametric and nonparametric approaches is that in the former
for some
, while in the latter
is typically the set of possible target functions itself, for example, the set of continuous functions or differentiable functions.
Relevant questions in this field regard the construction of reasonable estimators, consistency, rates of convergence and their optimality, and adaptive estimation.[6]
Consistency
As in parametric regression, a desirable property for an estimator
is that it converges to the target function
as the sample size
goes to infinity, that is, the approximation error converges to zero. Usually, the approximation is measured in terms of
-norm distance between
and
. Since the estimator is a function of the randomly drawn data
, the approximation is a random variable as well, and so we distinguish two different modes of convergence:
Weak consistency:
.
Strong consistency:
almost surely.
If an estimator is consistent for all square-integrable
, then it is called weakly/strongly universally consistent.[5]
Many common nonparametric estimators are weakly universally consistent, such as the Nadarya-Watson estimator, kNNs and certain local polynomial estimators.[5]
Minimax optimal rates of convergence
A central topic in the statistical analysis of nonparametric estimators is their speed of convergence towards the true target function
and whether the speed is optimal, i.e., the convergence is as fast as possible. The most common way to measure the speed of convergence of an estimator is the minimax convergence rate, which considers the expected loss of the estimator in the worst case scenario. Under certain assumptions on the smoothness of
, one can show that there is a minimal convergence rates that no estimator can undercut, and so any estimator achieving this minimal rate is called optimal.
Mathematically speaking, the target function
is assumed to belong to some class of functions
, called the hypothesis class, inducing a distribution
on
, and the approximation quality of an estimator
is measured by some function
. The minimax convergence rate of
is a sequence
of real numbers for which it holds
where
indicates that the random variables
, which draw the data points, have distribution
.
A universal lower bound on estimation for a hypothesis class
is a sequence
for which it holds
where the infima are taken over all possible estimators
(that is, measurable functions) based on
observations.
The detailed analysis of nonparametric estimators then divides into the estimation of probability densities and regressions functions.
Density estimation
The setting of density estimation typically involves a normed space of functions
, a subset of density functions
and independent random variables
distributed according to the measure with density
, which generates the data.
Minimax lower bounds are known for different pairs of function classes
and comparison metrics
. Common choices for
are:
: The space of
-times differentiable functions with the highest derivative being
-Hölder-smooth.
: The space of Sobolev-smooth functions with square-integrable weak derivatives.
: The space of Besov-smooth functions.
In fact, the Hölder spaces and the Sobolev spaces are special cases of some Besov spaces, namely
for
and
.[7] Thus, it often suffices to derive lower bounds under Besov-smoothness assumptions.
Common choices for
are:[6]
: The pointwise squared error (MSE).
: The Mean Integrated Square Error (MISE).
: The supremum-norm-distance.
: The Kullback-Leibler divergence of the distributions induced by
and
.
: The total variation distance of the distributions induced by
and
.
: The Wasserstein-
distance of the distributions induced by
and
.
By Scheffé's theorem, the total variation distance
is equivalent to the
-distance of
and
.
| Smoothness class |
 |
 |
 |
 |
 |
[8] |
[9] |
[8] |
 |
 |
[10] |
- |
- |
- |
Kernel density estimators, for instance, achieve the lower bound w.r.t. the MISE under a Sobolev hypothesis class under an appropriate bandwidth choice and is thus minimax optimal.[6]
Regression
In the regression setting, the data arises in pairs
. Assuming that the data is independent and identically distributed, and
, one can always write
with
being the regression function to be estimated and a noise variable
fulfilling
and
. Typically, the independent variables
are assumed to have values in the unit cube
and to be either determinsitic points on a grid (deterministic design) or uniformly distributed (random design). Thus,
.
The above setting applies to binary classification as well. In that case, the observations take only two values, say 0 and 1, such that
and given an estimator
of
, the classifiers are assumed to have the form
, that is, they classify a point as 1 if the estimated probability of
is greater than
(and 0 otherwise). Indeed, many classification methods are of that form, for example logistic regression, linear discriminant analysis, quadratic discriminant analysis, and k-nearest-neighbors, and support vector machines.
Then, for the statistical analysis, the hypothesis class is of the form
for some normed space of functions
and expectations
are taken with respect to the joint distribution of
and
(or just
if the
are deterministic).
In nonparametric regression, common choices for
are:
: The space of
-times differentiable functions with the highest derivative being
-Hölder-smooth.
: The space of Sobolev-smooth functions with
-integrable weak derivatives.
Common choices for
are:
: The pointwise squared error (MSE).
: The
-th norm.
: The supremum-norm-distance.
Under certain technical assumptions, the following lower bounds are known.
| Smoothness class |
 |
![{\displaystyle L^{p}([0,1]^{d})}](https://wikimedia.org/api/rest_v1/media/math/render/svg/712504042b334c95cba31b07f760a1a6cbbe513e) |
![{\displaystyle L^{\infty }([0,1]^{d})}](https://wikimedia.org/api/rest_v1/media/math/render/svg/fe7407aefa161b880f56885cfabe41f0e12d419d) |
![{\displaystyle C^{\alpha }([0,1]^{d})}](https://wikimedia.org/api/rest_v1/media/math/render/svg/1d5e6766b3d26e4fac98a96b9fcc988fb16abaac) |
[6] (determ. design) |
[6][11] |
[6][11] |
![{\displaystyle W^{k,q}([0,1]^{d})}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f8b60112324c5caf11a0e9f28302096afb0dfd1a) |
- |
[11]  |
[11]  |
Some local polynomial estimators are minimax optimal w.r.t.
under
for arbitrary
for an appropriate choice of the bandwidth.[6] kNNs are also minimax optimal w.r.t. the MSE under
and w.r.t.
under
for an appropriate choice of the number of considered neighbors.[5]