Statistics Module 5 Hypothesis Testing : Introduction to T-Distribution for Unknown Population σ
Fig: Middle Road OPC Pvt Ltd
This read, discusses the concept of hypothesis testing using t-distribution. Check out the tutorial on statistical inference.
The comprehensive note enunciates about conducting hypotheses testing about a population mean when the population standard deviation σ is not known. t-distribution concept is introduced as part of continuous probability distribution for this section. Until now, hypothesis testing is covered for population mean when the population standard deviation is known. Above there types of test. In case when the population standard deviation σ is not known, a sample of the population is conducted to estimate both sample mean and standard deviation (µ, σ). For conducting hypothesis test for known standard deviation σ, standard normal curves using z statistic is used. However, for unknown σ a new distribution student t-distribution would be used. T-distribution is also helpful when the sample size is small. Student’s t-distribution (popularly known as t-test) for hypothesis testing when the population population standard deviation σ is unknown and the sampling distribution is approximate to normal distribution. The process is similar to the one followed for when σ is known except now we refer to t-test and different manner for measuring p-values.
Degrees of freedom = n-1, Test statistic has n-1 degrees of freedom, s= sample standard deviation, N = no. of observations or size of sample
Both single tailed and two tailed tests are used conducting the hypothesis test t distribution depending on the hypothesis being tested. The calculation of p-value changes based on a different table or excel commands. This section would also include to make t-distribution graphs (probability density function and commutative distribution) in excel using the function T.DIST and elaborates two T.DIST.2T and T.DIST.RT
Both normal and student t- distribution are examples of continuous probability distribution wherein values can take any value within an interval. Random variables are numerical values which can have finite or infinite set of values. Until now only continuous probability distribution like normal distribution is discussed. Discrete probability distributions like Binominal and Poisson along with Bayes theorem will be covered separately under probability section on The middle Road. A brief module on random variables will be coming up as well.
Left: Middle Road OPC Pvt Ltd
# What is Student t-distribution?
Student t-distribution is similar to a standard normal curve with mean of zero and the distribution is symmetrical about the mean. It has bell shaped and the distribution is close to normal. The distribution has long heavier tails and is shorter in the center compared to a standard normal curve.
Degrees of freedom decide the curve of the t-distribution with each curve unique to degrees of freedom. As degrees freedom increase, the t-distribution becomes more like a standard normal curve. Degrees of freedom is a parameter which highlights independent variables or observations involved in calculating the estimate or statistic. t-distribution is particularly good when used for small data sets as for large sample size, the t-distribution approaches normal distribution. Fig above Middle Road OPC Pvt ltd. The variables are randomly selected and the distribution is close to normal. If the data set is highly skewed, its important to have a sample size close to 50. 1
# Student t-distribution Cumulative Distribution Function
Graph of cumulative distribution function plotted for 30 degrees of freedom for values of x. The graph is plotted using T.Dist function . The function gives the cumulative probability to the left of the t-value for a given degrees of freedom. True or 1 gives the cumulative while false or 0 gives PDF. A tutorial on how to make t-distribution graphs will be also uploaded going forward.
Pic: The middle Road | Middle Road OPC Pvt Ltd
Is t-model better in explaining outliers? Hypothesis Testing t-distribution
t-model signifies higher tail risk and based on appear published by C Sumarni1,2, K Sadik1, K A Notodiputro1 and B Sartono1 1 Department of Statistics, Bogor Agriculture University, INDONESIA. 2 Statistics Indonesia, Bali Province titled “Robustness of location estimators under t-distributions: a literature review” shows that t-model is a better fit in explaining outliers for data which follows normal distribution. In the paper, the team published findings after conducting tests to understand how to fit the effects of solid pesticides on onion production when the data contains outliers. The assumption behind the test was that the quality of onion production depends on the effects of treatment meted out to plants example solid pesticides.
The group used regression analysis to explain the effect of addition of solid pesticides on the increase in production of onions (dependent variable) with data of 67 onion farmers from the middle Java province between May- August 2013. Out of 67 data points, one farmer is an outlier. The farmer produced 6.5 tons of onion using 2.6 kg of solid pesticides while most farmers produced 1 ton of onion using 1 kg of solid pesticides. Using both normal and t model the results have an interesting turn due to an outlier. Both normal and t-model (df=4) were equally good without the outlier but normal model gave a higher estimate of increase in production of onions with addition of 1 Kg of pesticide. Normal model gave higher estimate of intercept but value is not significant at 90 percent of confidence level. The p-value is high due to an outlier causing the estimated standard error of the regression coefficients in the normal model to be high. T-model turned out to be a better at explaining goodness of fit and keeping other things constant, the group concluded t model to be better in overcoming an outlier.
# Questions with Solutions
Fig The middle Road | Middle Road OPC Pvt Ltd
Hypothesis Testing and t-distribution
1: The Coca Cola Company reported that the mean per capital annual sales of its beverages in the United States was 423 eight-ounce servings (Coca Cola Company website, Feb 3, 2009). Suppose you are curious whether the consumption of Coca- Cola beverages is higher in Atlanta, Georgia, the location of Coca Cola headquarters. A sample of 36 individuals from the Atlanta area showed a sample mean annual consumption of 460.4 eight-ounce servings with a S.D of s=101.9 ounces. Using α = 0.05, do the sample results support the conclusion that mean annual consumption of Coca- Cola beverage products is higher in Atlanta?
(This question is from the book Statistics for Business & Economics ( Anderson, Sweeney, Williams, Camm, Cochran, Chapter 9)
Soln: This is a two- tailed test since the wordings of the conclusion is to find “that mean annual consumption of Coca- Cola beverage products is higher in Atlanta?”
Use T.DIST.RT function ( t-distribution function in excel, for single tailed test in excel (t-value, degrees of freedom). This function directly returns the cumulative probability to the right of the t-value which is what you want. The degrees of freedom will be 36-1 =35. The p-value = 0.017.
p-value < 0.05, reject the null. The alternative hypothesis is true, the mean annual consumption of Coca- Cola beverage products is higher in Atlanta.
Q2: Joan’s nursery specializes in custom-designed landscaping proposal for residential areas. The estimated labor cost associated with a particular landscaping proposal is based on the number of plantings of trees, scrubs and so on to be used for the project. For cost estimating purposes, managers use two hours of labor time for the planning of a medium sized trees. Actual times from a sample of 10 plantings during the past month are as follows. With a 0.05 level of significance, test to see whether the mean tree planting time differs from two hours. (This question is from the book Statistics for Business & Economics ( Anderson, Sweeney, Williams, Camm, Cochran, Chapter 9).
In this question, the sample size is only of 10 data points/ observations so t-distribution must be applied.
- State null and alternative hypothesis.
This is a two tailed test since “the mean tree planting time differs from two hours” The mean can be lower or higher than two hours.
2. Compute the sample mean
Add all the values of time in hrs. and divide by 10 or use average function in excel. You will get 2.2
3. Compute the sample standard deviation?
Use the function STDEV.S(time in hours column). Use sample standard deviation since the sample of data points is available.
The Ans is 0.5164
d. What’s is the p-value
t- statistic = 2.2 – 2/ (0.5164/sqrt(10) = 0.2/0.1633 = 1.22 (approx. to 2 digits) where s= 0.5164 i.e. the sample mean
The best function to use is T.DIST.2Tfunction where 2T is for two tailed test. The function calculates the continuous probability distribution for data for calculating student test distribution. The degrees of freedom will be N-1. p- value will be 0.25 this is greater than 0.05 so do not reject the null. The null holds. The mean tree planting time is two hours. ( Ans for p-value is 0.2535)
- Statistics for Business & Economics (Anderson, Sweeney, Williams, Camm, Cochran)
- C Sumarni1,2, K Sadik1, K A Notodiputro1 and B Sartono1 1 Department of Statistics, Bogor Agriculture University, INDONESIA. 2 Statistics Indonesia, Bali Province titled “Robustness of location estimators under t-distributions: a literature review”