Estimation Theory's relation to knowledge of Statistical Distributions
This story begins with a comment from sigmaalgebra, an anonymous mathematician and avc regular. I appreciated Sigma's explanation of theory enough to reproduce it here, because the discussion gets to the heart of estimation theory and it's relationship with knowledge about statistical distributions.
The Question: What is the performance (expected returns) of venture fund X?
Sir sigmaalgebra
Okay, we are an LP; there is venture firm A; and we want to evaluate their performance.What we want to know is, broadly, how much money can firm A make for us over, say, the next 10 years.
Let that amount of money be X. Then with really meager assumptions, X is a real valued random variable with expectation, finite expectation, and finite variance. In just what all this means, there are succinct, elegant, and profound details in, say,
Jacques Neveu, 'Mathematical Foundations of the Calculus of Probability', Holden-Day, San Francisco.
Note that I didn't claim that all the details were easy reading.
Well, we know that we can't know the actual value of X now. So, we settle for estimating the expectation of X, that is, E[X]. For fine details on the expectation E[X], Neveu is excellent.
Why the expectation? Because in the long run in practice, from the (weak or strong) law of large numbers, also in Neveu, that is what we will see. So, we settle for the expectation.
Note: If we want to consider utility functions, then we let X be the 'utility' we want and maximize its expectation.
So, how are we to estimate E[X]?
Well, if we can get a 'sample', say, for some positive integer n, random variables Y(1), Y(2), ..., Y(n), that are independent and have the same distribution as X (full details in Neveu), then we can take the average
Z = ( Y(1) + Y(2) + ... + Y(n) ) / n
and use real random variable Z as our estimator of E[X].
Okay, just why would we want to use Z instead of some 'ranking', etc.? That is, just why is Z the 'estimator' we want?
First, Z is 'unbiased', that is, E[Z] = E[X]. That is, in the long run, Z will give us the right answer instead of something off to one side.
Second, among unbiased estimators, Z is the most accurate, that is, has the least variance. Details are in, say, the classic:
Paul R. Halmos, "The Theory of Unbiased Estimation", 'Annals of Mathematical Statistics', Volume 17, Number 1, pages 34-43, 1946.
So, with Z we get the world's only minimum variance, unbiased estimator.
And that's why we use Z and why we use averages of past data, and some assumptions, to estimate future returns.
And that's why we don't take seriously 'heuristics' such as ad hoc ranks.
And that's why we don't like results that come from just 'algorithms' without some solid reasons to take the calculations seriously.
And that's why we laugh at much of 'data mining, machine learning, big data, heuristics, and artificial intelligence'.
And that's why we want some solid reasons for any data analysis we do and not just some intuitive suggestions.
And that's why math with theorems and proofs is so powerful.
And, Mr. P. Thiel, that's some of why we go to college.
Thus endith the first lecture in Data Handling 101.
My response was to bring up the common fallacies of datamining, and also to discuss the limitation of estimation theory on unknown distributions. I also wanted to point sigmaalgebra at a recent find I happened upon by Jeff Jonas in support of more data over more complex statistical apparatuses.
Me
1) Unbiased estimators depend on an underlying assumption about statistical distributions. If you're model is off, your optimization is as well. Also there are a number of documented cases where L1 (absolute value) outperforms L2 (least squares) which much of the classical theory relies on.2) "And that's why we laugh at much of 'data mining, machine learning, big data, heuristics, and artificial intelligence'. ", I agree that there are common fallacies, but I think you'll enjoy Jeff Jonas' counter point and humorous title, "data beats math".
Mr. Jonas' findings:
"My gripe, if any, is that way too many people are chipping away at hard problems and making no material gains in decades (e.g., entity extraction and classification) … when what they actually need is more data. Not more of same data, by the way. No, they more likely need orthogonal data – data from a different sensor sharing some of the same domain, entities and features (e.g., name and driver’s license number)."
This morning I awoke to an in depth reply and schooling in linear algebra, which brought back fond memories of subspace theory in graduate school with Professor Zemanian. The central limit theorem crept in (justified but based on another assumption), as well as a practical statement describing statistical assumptions. In particular Sigma made it clear that the set of variables Y share the distribution of the fund in question X.
Sir SigmaAlgebra
Mark,For your:
"Unbiased estimators depend on an underlying assumption about statistical distributions."
but on this I assumed merely that
"random variables Y(1), Y(2), ..., Y(n), ... have the same distribution as X"
and that is sufficient.
Or, for algebraic details, which are easy enough, since, for i = 1, 2, ..., n, each Y(i) has the same distribution as X, we have that E[Y(i)] = E[X]. Then from our estimator
Z = ( Y(1) + Y(2) + ... + Y(n) ) / n
using that expectation is linear (Neveu), we have:
E[Z] = E[( Y(1) + Y(2) + ... + Y(n) ) / n]
= ( E[Y(1)] + E[Y(2)] + ... + E[Y(n)] ) / n
= ( E[X] + E[X] + ... + E[X] ) / n
= ( n E[X] ) / n
= E[X]
so that
E[Z] = E[X]
as desired.
The minimum variance part from the Halmos paper is tricky to see but, still, makes only meager assumptions about the distribution of the data.
For your:
"Also there are a number of documented where cases L1 (absolute value) outperforms L2 (least squares) which much of the classical theory relies on."
Note: In L^1 and L^2, etc., the L abbreviates H. Lebesgue. He was an E. Borel student about 100 years ago and redid the integration in calculus. One result was 'measure theory' which via Kolmogorov became the foundations of 'modern probability' as in Neveu.
L^1, L^2 are 'norms', essentially definitions of distance, on 'vector spaces'. In the case of a random variable such as X, the L^1 norm is
|| X ||_1 = E[ |X| ]
On the left, the double vertical bars are common norm notation (in D. Knuth's TeX they look better). On the right, the single vertical bars are just absolute value. Here I borrow from TeX and let _1 denote a subscript.
The set of all X with finite L^1 forms a vector space. It turns out, as in Neveu, it's 'complete' and, hence, a Banach space. 'Complete' is much as in the real numbers, that is, for a sequence that appears to converge, there is something there for it to converge to. The rationals are not complete because rationals can converge to, e.g., pi which is not a rational. One joke is, "Calculus is the elementary properties of the completeness property of the real number system."
For the L^2 norm of X, that is
|| X ||_2 = E[ X^2 ]^(1/2)
Similarly the set of all X with finite L^2 forms a vector space. It turns out (Neveu again), it's 'complete' and, hence, a Banach space.
But also important is, for this vector space, we can define the 'inner product'
(X,Y) = E[ XY ]
Then the norm is just from the inner product:
|| X ||_2 = (X,X)^(1/2)
With this inner product, we have a Hilbert space, which is a good thing because we can do projections.
Completeness for the random variables under the L^1 and L^2 norms is a bit amazing.
It appears that Hilbert space is an abstraction of several important examples from the 19th and 20th centuries, especially Fourier theory, various orthogonal polynomials, spherical harmonics, various parts of differential equations, etc. So, get to do the derivations once, just from the axioms, and apply them many times.
Apparently Hilbert space was a von Neumann idea; there's a claim that once he had to explain what he meant to Hilbert!
Back to our estimator Z: When we have n samples, let's call our estimator Z(n). Then, as n grows large, we would like Z(n) to be a better estimator of E[X].
Well, in the L^2 norm, our error squared is:
|| Z(n) - E[X] ||_2^2 = E[ (Z(n) - E[X])^2 ]
= E[ (Z(n) - E[Z(n)])^2 ]
= Var( Z(n) )
or the variance of Z(n). Such math looks MUCH better in TeX!
But from the definition of Z(n) and our assumptions and our use of Y(1), Y(2), ..., Y(n),
Var( Z(n) ) = Var( ( Y(1) + Y(2) + ... + Y(n) ) / n )
= ( 1/n^2 ) Var( ( Y(1) + Y(2) + ... + Y(n) ) )
= ( 1/n^2 ) Var( Y(1) ) n
= Var( Z(1) ) / n
So, as n grows, our error
|| Z(n) - E[X] ||_2 = Std( Z(1) ) / n^(1/2)
where Std is the standard deviation, that is, the square root of the variance.
So for large n, our L^2 error converges to 0. So, our estimator Z(n) converges to E[X] in the L^2 norm. Then (Neveu) some subsequence of our sequence Z(n) must converge to E[X] with probability 1, that is, 'almost surely', and that's the best we can hope for. In practice, or with more analysis or assumptions, Z(n) will converge to E[X} almost surely without the issue of subsequences.
For practice, convergence in L^2 is good enough to take to the bank.
If we were to work with L^1 instead, then things would be more difficult. Also we might end up estimating the median of X and, thus, have a biased estimator of the expectation of X.
Also, we can know more than just that Z(n) converges to E[X}; in addition we can find 'confidence' intervals. From the sum of the Y(i) and the central limit theorem, for n over a few dozen, the distribution of Z(n) will be close to Gaussian. So, we can use a t-test to get a confidence interval. If we want to be still more careful, then there are some simple, somewhat 'computer intensive', 'distribution-free' (where we make no assumptions about the distribution of X) methods for getting a confidence interval on our estimate.
On the value of 'big data', that's questionable. I would recommend being careful about what we know about the data we seek and use. The criterion of "orthogonal" is not so good; likely he meant independent.
One warning about 'big data' is in our:
|| Z(n) - E[X] ||_2 = Std( Z(1) ) / n^(1/2)
So if we multiply n by 100, then we divide the right side by the square root of 100, that is, 10 and, thus, get just one more significant digit of accuracy in our estimate. So, getting three significant digits is common; getting four is usually a strain; getting eight or more is nearly absurd.
Commonly getting three significant digits is so easy that struggling with 'big data' to get, say, six significant digits is not worth the effort.
One of the problems in practice, e.g., in 'data mining' and 'machine learning', is just 'fitting models'. There can be some value there, but there are also some serious pitfalls: E.g., for some positive integer n and for numerical data (x(i), y(i)), for i = 1, 2, ..., n, if the x(i) are distinct then we can just write down a polynomial p of degree n-1 so that, for all i, p(x(i)) = y(i). This is called Lagrange interpolation and has a richly deserved reputation for just absurd results. Fit? Yes. Useful? No! For something less absurd, can also use various splines, least-squares splines, and for functions of several variables. Still, getting 'causality', 'meaning', or good predictions from such efforts is not easy.
My estimator of the coveted E[X] was from samples Y(i), but the thread assumed some additional data about first rounds and follow on rounds. So, with this additional data, might a more accurate estimate be possible? In principle, yes.
One approach, the most powerful possible in the L^2 sense given sufficient data, is just old cross tabulation. That is, and an easy exercise starting with Neveu, if we have some data Y and want to estimate X, then we can take the conditional expectation of X given Y, that is, E[X|Y]. Then there is some function f so that E[X|Y} = f(Y), and this function f minimizes
|| X - f(Y) ||_2
Well, cross tabulation can be used to give an estimation of a discrete approximation to f.
Another approach, that can work with less data, is to assume a 'model' with some parameters and then determine the parameters by fitting to the available real data. Pursuing this direction is beyond the scope of this post now!
This would be, what, a Q&A session after the first lecture in Data Handling 101?
Neveu? An elegant, powerful, crown jewel of civilization.
I found an hour this morning to carefully read through and think about Sigma's reply:
Me
Thank you for the in depth reply and review of subspace theory. Your careful description brought back fond memories of graduate school, and your passion for the subject is obvious.Let's focus on a couple of assumptions that appear minor but could lead to surprise and unexpected results.
"random variables Y(1), Y(2), ..., Y(n), ... have the same distribution as X"
Do you believeWould you wager that there is a distribution for X that bounds the behavior of the venture fund in question? Even if there is such a distribution and X is historically well behaved how do you account for singularities?----------------------------------------------------------------------
Also in this step:
|| Z(n) - E[X] ||_2^2 = E[ (Z(n) - E[X])^2 ]
= E[ (Z(n) - E[Z(n)])^2 ]
= Var( Z(n) )you replace E[X] with sequence Z(n) in order to later describe how Z(n) converges to E[X] for large n.
|| Z(n) - E[X] ||_2 = Std( Z(1) ) / n^(1/2)
That appears to be using the result of the derivation to prove the derivation.
----------------------------------------------------------------------
With regards to the central limit theorem:
"From the sum of the Y(i) and the central limit theorem, for n over a few dozen, the distribution of Z(n) will be close to Gaussian."With this statement are you pegging E[X] as Gaussian? That can't be correct. Once you assume E[X] is Gaussian you're working on a different problem, and no longer trying to predict an individual venture fund's returns. Are you stating Fred and the team at USV are properly characterized by a Gaussian random variable :)? Of course you could fit a Gaussian curve through their portfolio returns, but doing so doesn't make company returns adhere to the distribution. Fred has described fund performance as a power law curve in the past here.
----------------------------------------------------------------------
On big data: (I wish I could annotate your comment to split off better side discussions). The limitation you describe on significant digits is predicated on your model: ||Z(n) - E[X]||_2 = Std( Z(1) ) / n^(1/2). What Jeff described was by using simpler models and more data he achieved better results than by applying more complex models to far less data.
Is there a large difference between model fitting (agreed there are good and poor techniques), and assumed distributions? I could represent unknown spaces of measured data with a placeholder distribution that describes local region behavior instead of generating presumed values (splines/interpolation/least squares).
I didn't have time to dig in and understand your mention of cross tabulation, but the relationship has me interested E[X|Y] = f(Y) where f minimizes || X - f(Y) ||_2. The best estimator for current fund performance should weight previous performance, I just don't know how that would look (environment and macro economic trends impact performance as much as the selection of the investors).
I've relied on Taylor series expansions a number of times in Physics while chasing dominant relationships. I feel comfortable pursuing dominant relationships with limited parameter spaces. Allow parameter number to grow and your fits keep on improving (always penalize additional degrees of freedom when scoring).