# Thread: Can someone check my Stats?

1. ## Can someone check my Stats?

I've forgotten all my stat-math.

Here's my problem:

In a population of 20 million people I know that 53 people have the surname 'X'. (I need to know random values, ignoring family and tribal connections)

If I reach into the people-bucket and pluck out one person, then I calculate the chances of that person being X are (20,000,000 / 53) = 370,000:1 against. To be statistically confident of plucking out X I would have to scoop out 370,000 people at once.

But if I can only scoop out 20,000 per dip, then what are the odds of 'X' being in that 20K scoop? I work this out at

A: (370,000 / 20,000) = 18.5

Now what are the chances of a random 20K scoop containing two 'X's?

B: (18.5 x 18.5) = 340:1 or 6.8 million people.

Now what are the chances of them both having the same initial?

C: (340 x 26 x 26) = 230,000:1 or 4.6 billion people.

I would have to scoop the bucket 230,000 times (with 20K scoops) or I would need a population of 4.6 billion people to be confident that two 'X's have the same initial?

Is this correct?

Cheers Greg

2. As a general principle, remember to distinguish "A to 1 against" odds from "B for 1 against." B = A+1

Another useful idea is to be precise: 20000000/53 is over 377,000, not 370,000. Approximations are fine for many purposes, but why impose unnecessary burdens when seeking help?

Originally Posted by Grendel
If I reach into the people-bucket and pluck out one person, then I calculate the chances of that person being X are (20,000,000 / 53) = 370,000:1 against.

[X]: To be statistically confident of plucking out X I would have to scoop out 370,000 people at once.

But if I can only scoop out 20,000 per dip, then what are the odds of 'X' being in that 20K scoop? I work this out at

A: (370,000 / 20,000) = 18.5

Now what are the chances of a random 20K scoop containing two 'X's?

B: (18.5 x 18.5) = 340:1 or 6.8 million people.

[C]: Now what are the chances of them both having the same initial?...
X: Don't be too confident! You'll fail to find X in that sample 36.8% of the time. (That's the reciprocal of Euler's Number.)
Intuition: The expected (average) number of X you'll find is 1.0 exactly. But sometimes you'll find more than 1; therefore sometimes you'll find less.

[A] Your calculation yields 18.87 for 1 (or 17.87 to 1), not 18.5 to 1. But the calculation would be incorrect anyway, for reasons related to (x).
A better approximation(*) is that success occurs with probability 1 - exp(20000* ln(1 - 53/20000000)) = 5.16%
By coincidence this is expressed as 18.4 to 1 — almost the figure you give. Sometimes two wrongs do make a right!
(* - this formula works for sampling with replacement. It's good enough here because X's density, starting at 53 units, gets up only to 53.053 units after 19,999 non-X withdrawals.)

[B] Repeating this calculation your way, but with the corrected numbers just shown, produces 1 chance in 375.3, written 374.3 : 1.
But your approach is incorrect; more accurate odds would be 737 : 1.
Why? Your multiplication assumes that the chance an unknown group will have an X is identical to the chance that a group known to have at least one X (though which element is X is unknown) will have a second X. But in fact these X-nesses are not independent.
I don't think it's a coincidence that the correct odds are almost exactly half what your incorrect calculation gives. (The correct odds of THREE or more X's would be 1/6 of what your calculation produces. 6 = 3! I'll leave a demonstration of this as an exercise! ... An exercise for myself; I've forgotten much of what I once knew. )

[C] Do you think the 26 letters are equally likely as middle inititals? I don't ... and suggest you first refine your approaches for (A) and (B) anyway.

3. Originally Posted by Swammerdami
[A] Your calculation yields 18.87 for 1 (or 17.87 to 1), not 18.5 to 1. But the calculation would be incorrect anyway, for reasons related to (x).
A better approximation(*) is that success occurs with probability 1 - exp(20000* ln(1 - 53/20000000)) = 5.16%
By coincidence this is expressed as 18.4 to 1 — almost the figure you give. Sometimes two wrongs do make a right!
(* - this formula works for sampling with replacement. It's good enough here because X's density, starting at 53 units, gets up only to 53.053 units after 19,999 non-X withdrawals.)
The calculation for sampling without replacement is not too difficult,and it's good to see how much that changes the answer. I get 1 - C(19999947,20000)/C(20000000,20000) = 0.0516452 for the probability without replacement and 1 - (19999947/20000000)20000 = 0.0516201 for the probability with replacement. The difference between odds of 18.36:1 and 18.37:1.

[B] Repeating this calculation your way, but with the corrected numbers just shown, produces 1 chance in 375.3, written 374.3 : 1.
But your approach is incorrect; more accurate odds would be 737 : 1.

Why? Your multiplication assumes that the chance an unknown group will have an X is identical to the chance that a group known to have at least one X (though which element is X is unknown) will have a second X. But in fact these X-nesses are not independent.
For the sample without replacement, the probability is 1 - (C(19999947,20000) + C(53,1)C(19999947,19999))/C(20000000,20000) = 0.00133195, giving odds of around 750:1.

I don't think it's a coincidence that the correct odds are almost exactly half what your incorrect calculation gives. (The correct odds of THREE or more X's would be 1/6 of what your calculation produces. 6 = 3! I'll leave a demonstration of this as an exercise! ... An exercise for myself; I've forgotten much of what I once knew. )
I think it might actually just be a coincidence. Try doing the calculations with different numbers.

4. Originally Posted by beero1000

The calculation for sampling without replacement is not too difficult ...
I don't think it's a coincidence that the correct odds are almost exactly half what your incorrect calculation gives. (The correct odds of THREE or more X's would be 1/6 of what your calculation produces. 6 = 3! I'll leave a demonstration of this as an exercise! ... An exercise for myself; I've forgotten much of what I once knew. )
I think it might actually just be a coincidence. Try doing the calculations with different numbers.
The formula isn't difficult. I just didn't know a fast way to calculate large c(,) or factorial on my machine and was too lazy/groggy to try to apply Stirling's approximation.
I see that Wolfram Alpha will calculate those large c(,), though not very quickly. How do you do it?

Yes, I could have divided all the large numbers by 10 — then my machine would handle them, though still slowly. But as I said, and you agreed, the with-replacement approximation was good enough here.

As for p2 ~= .5 * (1-p0)^2 when 20,000,000 >> 20,000 >> 53, where ">>" denotes "MUCH greater than" and pk is probability of exactly k hits, I did, just now, succeed in proving this ... though with great effort(*). Effort so great that I won't attempt to prove the conjecture pk ~= (1/k!) * (1-p0)^k

I suspect that if/when my brain unfogs I'll stumble on a familiar asymptotic formula with these approximations readily derived.

(* - Not "effort" in the sense of a mathematical challenge. Just effort in doing routine but tedious algebraic manipulations. I used to be better than this ... really! :-) )

5. Originally Posted by Swammerdami
... when 20,000,000 >> 20,000 >> 53 >> 1, where ">>" denotes "MUCH greater than" and pk is probability of exactly k hits, ... I won't attempt to prove the conjecture pk ~= (1/k!) * p0 * (1-p0)^k
[Note corrections in red]
And as soon as I stepped away from keyboard, proof became trivial!

Let's substitute s = 20 million; w = 20 thousand. We'll leave 53 alone.
Given s >> w >> 53 >> 1,k we seek to prove that
pk = C(53,k) C(s-53,w-k)) / C(s,w)
is approximated with
pk ~= p0 (1 - p0) ^ k / k!
Change the C(.) to factorials:
pk ~= 53! (s-53)! w! (s-w)! / (w-k)! (s-53-w+k)! s! k! (53-k)!
Change a! / (a-b)! to the approximation a^b whenever a >> b; and rearrange a bit to get
k! pk ~= 53^k w^k (s-w)^53 / s^53 (s-w)^k
Solve for p0 to get p0 ~= (s-w)^53 / s^53 and recall that, when s >> w >> 53, p0 ~= 1 - 53w/s or
1 - p0 ~= 53w/s
Observing that (s/(s-w))^k ~= 1, substitutions now produce
k! pk ~= p0 (1-p0)^k
Q.E.D.

6. ummmmm ....

So, what are the odds of two people, surname X, initial A,
occurring in a random 20k sample,
when the surname X accounts for only 53 names in 20 million

?

Greg

7. Originally Posted by Grendel
ummmmm ....

So, what are the odds of two people, surname X, initial A,
occurring in a random 20k sample,
when the surname X accounts for only 53 names in 20 million

?

Greg
Without knowing the frequency of initial A.?

8. 1 in 26 letters of the alphabet?

9. Most popular boy/girl baby names in 1950:

1. James / Linda
2. Robert / Mary
3. John / Patricia
4. Michael / Barbara
5. David / Susan
6. William / Nancy
7. Richard / Deborah
8. Thomas / Sandra
9. Charles / Carol
10. Gary / Kathleen

Here's the list from last year:

1. Jacob / Emily
2. Michael / Isabella
3. Ethan / Emma
4. Joshua / Ava
5. Daniel / Madison
6. Christopher / Sophia
7. Anthony / Olivia
8. William / Abigail
9. Matthew / Hannah
10. Andrew / Elizabeth

Wait - I forgot Ahmed, Abdel, Ali, Ashraqat, Aya...

10. Originally Posted by Lion IRC
Most popular boy/girl baby names in 1950:

Wait - I forgot Ahmed, Abdel, Ali, Ashraqat, Aya...