In statistics, a population refers to the entire group of individuals or objects under study, while a sample is a subset of the population. For example, if we're studying the height of all students in a school, the population would be all students, while a sample might be 100 randomly selected students.
Note
It's often impractical or impossible to study an entire population, which is why we use samples to make inferences about the population.
A random sample is one where each member of the population has an equal chance of being selected. This is crucial for ensuring that the sample is representative of the population.
Common Mistake
Many people assume that any sample is a random sample, but this isn't true. For example, surveying only your friends about a political issue isn't a random sample of the population.
Data can be classified as discrete or continuous:
The reliability of data sources and potential bias in sampling are critical considerations in statistics. Bias can occur when the sample doesn't accurately represent the population.
Example
If we're studying the average income of a city but only sample people from wealthy neighborhoods, our results will be biased and not representative of the entire city.
Outliers are data points that differ significantly from other observations. They can have a substantial impact on statistical analyses and should be carefully considered.
Tip
When encountering outliers, don't automatically discard them. Investigate why they exist and consider their impact on your analysis.
A frequency distribution shows how often each value occurs in a dataset. It can be presented in a table or graphically.
Histograms are bar graphs that display the frequency distribution of continuous data. The x-axis represents the data values, usually in intervals, and the y-axis shows the frequency.
These graphs show the cumulative frequency up to each interval. They're useful for finding medians and percentiles.
Also known as box plots, these diagrams provide a visual summary of the distribution of data, showing the median, quartiles, and potential outliers.
Example
For the dataset: 2, 3, 3, 4, 5, 5, 6, 7, 8 Mean = (2 + 3 + 3 + 4 + 5 + 5 + 6 + 7 + 8) / 9 ≈ 4.78 Median = 5 Mode = 3 and 5 (bimodal)
Note
Standard deviation is often preferred over variance as it's in the same units as the original data.
Correlation measures the strength and direction of the linear relationship between two variables. The correlation coefficient, r, ranges from -1 to 1.
Linear regression finds the best-fitting straight line through a set of points. The equation is typically in the form y = mx + c, where m is the slope and c is the y-intercept.
Example
If we have data on students' study time (x) and their test scores (y), we might find a regression line like: y = 2x + 60 This suggests that for each additional hour of study, the test score increases by 2 points, with a base score of 60 for no study time.
Various diagrams can be used to calculate probabilities:
Example
This tree diagram shows the probabilities for flipping a fair coin twice. The probability of getting two heads is 0.5 × 0.5 = 0.25.
Conditional probability is the probability of an event occurring given that another event has already occurred. It's denoted as P(A|B), read as "the probability of A given B".
$P(A|B) = \frac{P(A \cap B)}{P(B)}$
Common Mistake
People often confuse P(A|B) with P(B|A). These are generally not the same!
A discrete random variable is a variable that can only take specific values. Its probability distribution gives the probability for each possible value.
The expected value (E(X)) of a discrete random variable X is the sum of each possible value multiplied by its probability:
$E(X) = \sum x_i P(X = x_i)$
Example
For a fair six-sided die, E(X) = 1(1/6) + 2(1/6) + 3(1/6) + 4(1/6) + 5(1/6) + 6(1/6) = 3.5
The binomial distribution models the number of successes in a fixed number of independent trials, each with the same probability of success.
If X ~ B(n, p), where n is the number of trials and p is the probability of success on each trial:
$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$
Example
If you flip a fair coin 10 times, the probability of getting exactly 7 heads is: $P(X = 7) = \binom{10}{7} (0.5)^7 (0.5)^3 ≈ 0.1172$
The normal distribution is a continuous probability distribution with a bell-shaped curve. It's characterized by its mean (μ) and standard deviation (σ).
The standard normal distribution has μ = 0 and σ = 1. We can convert any normal distribution to standard normal using the z-score:
$z = \frac{x - \mu}{\sigma}$
Note
The z-score tells us how many standard deviations a value is from the mean.
To find probabilities for normal distributions, we typically:
Example
If heights in a population are normally distributed with μ = 170 cm and σ = 10 cm, what's the probability of a person being taller than 185 cm?
So, about 6.68% of the population is taller than 185 cm.
Bayes' theorem relates conditional probabilities:
$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
This is particularly useful when we want to update probabilities based on new evidence.
For continuous random variables, we use probability density functions (PDFs) instead of probability mass functions. The probability of a specific value is always 0; we find probabilities for ranges of values by integrating the PDF.
For a continuous random variable X with PDF f(x):
If Y = aX + b, where X is a random variable and a and b are constants:
Tip
This is particularly useful when standardizing normal distributions!
Note
Throughout your study of Statistics & Probability, remember that while calculations are important, interpreting results in context is crucial. Always consider what your statistical findings mean in the real world.