Any statisticians out there in TubeNet land?

SRanney · Post by **SRanney** » Fri Nov 19, 2010 3:45 pm

(After several fruitless days of searching for a plain-speak (read: not statistics jargon) answer, I come asking the TubeNet Freak Jury: are any of you statisticians?)

I have data from some experimental work I did this summer that looks like:

level %surv
3 0.686440678
3 0.125
3 0.873873874
4 0.461538462
4 0.569105691
4 0.85046729
6 0.549450549
6 0.49382716
6 0.123076923
6 0.720930233
6 0.22972973
6 0.93
6 0.448275862
6 0.0
6 0.601851852

where % surv = the observed proportion of larval fish that survived in replicates at each level (the parameter is dissolved oxygen at mg/L). I can calculate "normal" means and 95% confidence intervals by

mean + or - 1.96*(SD)

but in many cases, that gives me an upper CI bound > 1.0. Unfortunately, % survival > 1.0 is impossible. Is there a published method that constrains confidence intervals to some a priori bounds? I could easily just chop off the CI bands at 0 and 1, but I'm certain that isn't the same.

Any help (or citations) anyone could provide would be appreciated. I've waded through several statistical texts (including Krebs, Zar, Sokal & Rohlf) and suffered through many papers to no avail. Perhaps I've overlooked something?

Thanks -

Steven

SRanney · Post by **SRanney** » Fri Nov 19, 2010 5:40 pm

the elephant wrote:Good luck, sir.

You never know. I've always been amazed at the expertise that exists here on TubeNet. I just might get lucky. Otherwise, I'll continue to dig into the stats literature.

Donn · Post by **Donn** » Sat Nov 20, 2010 2:51 am

I don't have the knowledge of statistics to be anything but a Tubenet commentator, but ...

Consider that the survival percentage is really a measure of the number of surviving individuals at a certain point. If you had extended the observation period, the survival percentage would be lower, etc. The `population' as it were, the individual larvae each have an actual survival span that is shorter, or potentially longer, than the observation period, and might in the simplest case follow a normal distribution. I have enjoyed quite a variety of alcoholic beverages this evening and become confused when I try to contemplate the relationship between this distribution, and the set of survival percentages you're working with, but at any rate I think it is fairly reasonable for an interval statistical measure based on normal distributions, to find an upper bound survival percentage larger than 1.0, when that 1.0 is just about some arbitrary sampling moment. My facile explanation doesn't account for a lower bound less than 0.0 - I wouldn't be certain you can't just chop it off, but that seems like such a common constraint of natural distributions, that there may be some more suitable statistical measures for that kind of data?

bort · Post by **bort** » Sat Nov 20, 2010 12:04 pm

I have a BS in Mathematics, but took exactly zero Statistics classes. Not sure how that happened. Wish I could help!

TubaRay · Post by **TubaRay** » Sat Nov 20, 2010 12:40 pm

Since I have 4 sem. hrs. of statistics, I believe I qualify as an expert. Here is my expert opinion: 76.3947% of all statistics are totally fabricated.

I'm just sayin'....

rocksanddirt · Post by **rocksanddirt** » Sun Nov 21, 2010 5:35 pm

Well....not an expert, but do have some familiarity with stats.

I think what you want is a confidence interval of the mean, yes?

that is a bit more complicated than the formula you posted. If you have/use MS Excel, there is a 'statistics' extentsion package to the normal formula's present, you can use that get better formulas.

SRanney · Post by **SRanney** » Sun Nov 21, 2010 7:35 pm

rocksanddirt wrote:Well....not an expert, but do have some familiarity with stats.

I think what you want is a confidence interval of the mean, yes?

that is a bit more complicated than the formula you posted. If you have/use MS Excel, there is a 'statistics' extentsion package to the normal formula's present, you can use that get better formulas.

Yes, a bounded (by 0 and 1) confidence interval of the mean, by level. While I'm not a statistics expert either, unless we're talking about two different things, for a normal distribution, calculating a confidence interval of the mean is as simple as the formula I provided. The confidence interval of a proportion is quite different, but as I'm calculating a mean of proportions, I think this is close to the formula I need.

I'm familiar with Excel as an analysis tool and have several additional "add-ins" added in (along with PopTools and the Analyais ToolPak), but none help me calculate a bounded CI on the mean. Any chance you could be more specific with the "statistics" package for Excel that you've mentioned? Excel generally has pretty good documentation/citations for their formulae. For statistical analysis, I generally use R.

Ultimately, these data will be used to test for differences in the means. The simplest way to test for differences would be using some form of the linear model (either regression or ANOVA), but now I'm thinking of pooling all data and eventually using a Z-test or Fischer's Exact test to determine the statistical significance level (at alpha = 0.05/# of paired comparisons) between each means.

Thanks -

Steven

tbn.al · Post by **tbn.al** » Mon Nov 22, 2010 11:23 am

If I ever questioned your intelligence, I apologize. I have absolutely no clue as to what you are even asking, but I thought I might quiz our resident clarinetist and actuary, Jim Brooks. He declined to make an attempt but referred it to his son who works with a bunch of stat guys. For what it is worth here are two possible answers:

In a survey of two of the statisticians here, one says to just truncate the confidence intervals at 1. For something more fancy, the other
recommends:

1) Construct a confidence interval for b = log(p/(1-p)), where p is the observed proportion. In jargon, you are constructing a confidence interval for the log odds.

2) Let (b_1, b_2) be the confidence interval for the log odds.
Transform your confidence interval back to a probability interval with (exp(b_1)/(1+exp(b_1)), exp(b_2)/(1+exp(b_2))). These limits will always be between 0 and 1.

Rick Denney · Post by **Rick Denney** » Mon Nov 22, 2010 11:36 am

Have you tested your data to be sure that it is normally distributed? If one population can never be greater than another, then the two distributions are not independent and the difference between them cannot be normally distributed. And if those differences are not normally distributed, then you can't use confidence intervals based on standard deviation, because standard deviation assumes normally distributed data.

Instead of comparing percentage survival, try comparing raw survival numbers. So, instead of writing it down as 0.90, try writing it a 900 survivors out of 1000. Then, do your statistics on the 900 number. You'll be comparing your before distribution to your after distribution, and for that I would recommend a non-parametric test like chi-square rather than a parametric test like mean analysis.

I run into it in traffic analysis when characterizing the time headway from one car to the next leaving a queue when the light turns green. That headway can never be less than about a quarter of a second because it if was the vehicles would be touching, and two objects can't occupy the same space at the same time. So I don't assume they are normally distributed. I end up with a distribution curve that is not symmetrical (as the normal distribution is--the classic symmetrical bell curve). My "bell" is tilted--skewed--to one side. The best distribution for that data turns out to be a negative exponential distribution that is shifted by the mean.

You could also compare the distributions directly without characterizing them by just comparing their discrete shapes. Chi-square is one way, Kolmogorov-Smirnov is another, and both of these non-parametric tests are in any statistics book. They keep you from having to characterize parameters such as the mean.

I don't know your data, and my knowledge of statistics is rather limited to my own domain, so that's about as far as I can go. Just remember that the tails of the normal distribution are asymptotic, and thus those tails can be infinitely high and low at small enough likelihoods. That's why it should only be used to compare means of independent populations.

Rick "thinking there might also be a failure of heteroskedasticity here, which also undermines the assumption of normal distribution, but you didn't want stat-speak--the log conversion suggested above addresses that possible issue, but not my issue" Denney

SRanney · Post by **SRanney** » Mon Nov 22, 2010 2:06 pm

Al and Rick -

Thanks for your thoughts. Regarding construction of a bounded CI of the mean, I'm thinking that either A) truncation is/will be fine, no matter how much I don't think it is, or B) pooling my replicate data and constructing a CI on the percent survival of the pooled data by level. Al, I considered the first option you provided which is similar to my B above. The second I hadn't considered, but will look into. It looks like a logit transformation that I'm familiar with, but will require more reading. However, it all may be moot based upon some other discussions I've had recently. Rick, regarding normality, when dealing with proportions, unless the data are skewed toward zero or one, the assumption is that no data transformation is necessary and parametric statistics can be used.

However, that said, in consultation with a fishing acquaintance of mine (who also happens to be a professor of biostatistics at a well known university in Atlanta), I think what I'm going to do analyze the data categorically (dead vs. alive) in each of three categories (i.e., "level) rather than equal-interval continuously (percentage/proportion). Analysis, then--rather than t-test/ANOVA/regression--will be Chi^2 or Fisher's Exact Test (which is very similar to Rick's suggestion). The argument he offered was that as soon as I moved away from the number of larvae dead vs. the number of larvae alive in a given replicate/treatment combination, I transformed my data to meet the preconceptions that most experimenters have. (In reality, 90% of the stats that fisheries biologists use are t-test/ANOVA/regression. As a result, we conceptualize our experiments to fit that simple statistics mold.) His suggestion then, allows me to treat my data as binomial (dead or alive) instead of continuous.

Rick Denney wrote:Rick "... but you didn't want stat-speak-..." Denney

Stat-speak I don't mind. I find it difficult wading through much of the symbolic language used in statistics journals. That statistics language was what I was trying to avoid...

Thanks for your thoughts!

Steven

Rick Denney · Post by **Rick Denney** » Mon Nov 22, 2010 4:11 pm

SRanney wrote:However, that said, in consultation with a fishing acquaintance of mine (who also happens to be a professor of biostatistics at a well known university in Atlanta), I think what I'm going to do analyze the data categorically (dead vs. alive) in each of three categories (i.e., "level) rather than equal-interval continuously (percentage/proportion). Analysis, then--rather than t-test/ANOVA/regression--will be Chi^2 or Fisher's Exact Test (which is very similar to Rick's suggestion). The argument he offered was that as soon as I moved away from the number of larvae dead vs. the number of larvae alive in a given replicate/treatment combination, I transformed my data to meet the preconceptions that most experimenters have. (In reality, 90% of the stats that fisheries biologists use are t-test/ANOVA/regression. As a result, we conceptualize our experiments to fit that simple statistics mold.) His suggestion then, allows me to treat my data as binomial (dead or alive) instead of continuous.

His suggestion is not just similar to what I said, it's exactly what I said. My only addition was that the transformation might undermine your assumptions of normal distribution. The transformation is dividing one set of data (survivors) into another (total population). Both have to be normally distributed for the result to be normally distributed, which is why you claim you can make the assumption without a skew to zero or one, which it won't be if both are normally distributed. But my point is that there is another test, too, and that's that the two sets of data have to be independent in addition to both being normal. Since number of survivors can never exceed total, they are not independent. That is not the same problem as a lack of heteroskedasticity, which is the problem exposed by being skewed.

Rick "who feels so validated" Denney

Uncle Buck · Post by **Uncle Buck** » Mon Nov 22, 2010 5:27 pm

4, 8, 15, 16, 23, 42

elimia · Post by **elimia** » Mon Nov 22, 2010 6:26 pm

The binary comparison is a good tip, it simplifies things. I too work in natural sciences (freshwater mussels and fishes) and would agree that most of what we use are simple linear models, so I tend to jump to T-test, regression, best of fit models. A tip a professor once told me, with continuous data, is to look at the mean vs median. If they are close, you can probably bet they are normally distributed. I think Mantel test is handy at looking at group normality.

A stats book I REALLY like is 'Analysis of Ecological Communities' by McCune and Grace. It is more geared to multivariate stats (species space vs environmental space concepts) but has some excellent writing on stats that is understandable. For questions like these, another book is by Robert Stoecker (sp?) - 'environmental analysis for non-scientists' or something like that. I am a scientist but it certainly helps refresh many of the concepts that I don't use everyday.

Stats are tough, it is good to have someone to consult with. Unless you are at a university or work for USGS, it is rare to find statisticians on staff.

TubeNet

Any statisticians out there in TubeNet land?

Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?

Re: Any statisticians out there in TubeNet land?