Quantitative Data Analysis and Visualisation — Comprehensive Study Notes Study Notes

Q: What topics are covered in these Quantitative Data Analysis and Visualisation — Comprehensive Study Notes notes?

These study notes cover key concepts and summaries for Quantitative Data Analysis and Visualisation — Comprehensive Study Notes.

Q: Are these Quantitative Data Analysis and Visualisation — Comprehensive Study Notes study notes free?

Yes, you can read these study notes for free on Cramberry.

Notes

📊 Descriptive statistics — key concepts

Descriptive statistics summarise numerical data using measures of location and dispersion. Common summaries: mean, median, mode, variance, standard deviation, minimum, maximum, range, quartiles. Visual summaries include histograms and boxplots.

➗ Measures of central tendency and dispersion

Sample mean: $\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i$ . This is the arithmetic average and is sensitive to outliers.
Median: the middle value when data are ordered. For even $n$ , average the two middle values.
Mode: the most frequent value(s).
Sample variance: $S^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2$ . Use $n-1$ for an unbiased estimator of population variance.
Sample standard deviation: $S=\sqrt{S^2}$ .
Range: $,\text{range}=\text{max}-\text{min}$ . Gives overall spread but is sensitive to extremes.
Quartiles (Q1, Q2, Q3) split the ordered data into four equal parts; Q2 is the median. These are used in boxplots.

📈 Boxplots and histograms — interpretation

Histogram: shows the distribution shape and preserves the original data bins. Useful for seeing modality and approximate distribution form.
Boxplot: highlights median, quartiles, whiskers (approximate spread), and outliers. Good for comparisons across groups.
Skewness from plots: A longer left whisker or long left tail indicates left-skew, a longer right whisker/right tail indicates right-skew, and symmetric whiskers indicate approximate symmetry.

🧪 Hypothesis testing for difference of two means (large samples)

When comparing two independent groups with large sample sizes ( $n\ge 30$ ), a z-test for difference of means can be used (using sample s.d. as estimates of population s.d.).

Null hypothesis: $H_0:;\mu_1-\mu_2=0$ (no difference).
Alternative (one-sided or two-sided) depending on the research question (e.g. $H_1:;\mu_1-\mu_2>0$ if expecting group 1 larger).
Test statistic: $z=\displaystyle\frac{(\bar{x}_1-\bar{x}_2)-(\mu_1-\mu_2)}{\sqrt{S_1^2/n_1+S_2^2/n_2}}$ . Under $H_0$ take $\mu_1-\mu_2=0$ .
Decision: compare $z$ to critical value $z_{\alpha}$ (one-sided) or use p-value from standard normal.

Example (numbers from a practice problem): $n_1=n_2=40$ , $\bar{x}_1=85$ , $\bar{x}_2=78$ , $S_1=12$ , $S_2=10$ . Compute $z=\frac{85-78}{\sqrt{12^2/40+10^2/40}}\approx 2.834$ . For a one-sided test at $\alpha=0.05$ , critical $z\approx 1.645$ , so reject $H_0$ and conclude group 1 has larger mean.

⚖️ Interpreting test results

If $|z|>z_{\alpha/2}$ (two-sided) or $z>z_{\alpha}$ (one-sided), reject $H_0$ at level $\alpha$ .
Report conclusion in context (e.g. "statistically significant evidence that young professionals spend more on groceries than retired individuals at the 5% level").

🔢 Chi-squared test for independence (contingency tables)

Use this to test association between two categorical variables (e.g. packaging type and spoilage).

Observed table: cells $O_{ij}$ , row totals and column totals.
Expected counts under independence: $E_{ij}=\frac{(\text{row total}_i)(\text{col total}_j)}{n}$ .
Chi-squared statistic: $\chi^2=\sum_{i}\sum_{j} \frac{(O_{ij}-E_{ij})^2}{E_{ij}}$ .
Degrees of freedom: $(r-1)(c-1)$ for an $r\times c$ table.
Decision: compare $\chi^2$ to critical value from $\chi^2_{\nu}$ or use p-value.

Example (numbers from the exam): Observed: Carboard spoiled 100, not spoiled 119 (row total 219); Plastic spoiled 200, not spoiled 153 (row total 353); column totals: spoiled 300, not spoiled 272; total $n=572$ . Expected example: $E(\text{Carboard, spoiled})=\frac{219\times300}{572}\approx114.86$ . Calculated $\chi^2\approx 6.55$ with $df=1$ . Critical at $\alpha=0.05$ is $\approx3.841$ , so reject independence — conclude an association between packaging and spoilage.

📉 Multiple linear regression — interpretation of R output

Model form: $E(Y)=\beta_0+\beta_1 X_1+\beta_2 X_2+\dots$ .
Coefficient interpretation: each coefficient is the expected change in $Y$ for a one-unit increase in that predictor, holding others constant.
From example R output: estimated model $E(\text{sales})=2.938889+0.045765(\text{TV})+0.188530(\text{radio})-0.001037(\text{newspaper})$ .
Significance: use t-statistics and p-values for each coefficient. Very small p-values (e.g. < 0.001) indicate strong evidence the coefficient differs from zero.
Prediction: plug in predictor values into the fitted equation for a point prediction. Example: for TV=10, radio=10, newspaper=10, $E(\text{sales})=2.938889+0.045765\times10+0.188530\times10-0.001037\times10\approx 5.278889$ (thousand units).
Residual: observed minus predicted. If observed sales = 10 (thousand), residual $=10-5.278889\approx 4.721111$ (thousand).
R-squared: proportion of variance explained by the model. Higher $R^2$ indicates better fit but beware overfitting. Prefer simpler models if they explain similar variance (parsimony).

Model selection example: $R^2$ for TV+radio+newspaper = 0.8972 and for TV+radio = 0.8972. The additional predictor (newspaper) gives no improvement in $R^2$ , so prefer the simpler TV+radio model.

🧮 Degrees of freedom and sample size from regression output

In the linear model output, residual degrees of freedom = $n - k$ where $k$ is the number of parameters (including intercept). So $n=\text{residual df}+k$ .
Example: residual df = 196, $k=4$ (intercept + 3 predictors), so $n=196+4=200$ regions.

🧾 Useful formulas (formula-sheet style)

Standard error of the sample mean: $\text{SE}(\bar{X})=\sqrt{S^2/n}$ .
Standard error for difference of two means: $\sqrt{S_1^2/n_1+S_2^2/n_2}$ .
Large-sample $100(1-\alpha)%$ CI for mean: $\left(\bar{X}-z_{\alpha/2}\frac{S}{\sqrt{n}},;\bar{X}+z_{\alpha/2}\frac{S}{\sqrt{n}}\right)$ .
Hypothesis test for proportion uses standard error $\sqrt{\pi_0(1-\pi_0)/n}$ under $H_0$ .
Chi-squared statistic: $\chi^2=\sum\frac{(O_k-E_k)^2}{E_k}$ .
Logistic regression form: $\log\left(\frac{P(Y=1)}{P(Y=0)}\right)=\beta_0+\beta_1X_1+\dots+\beta_pX_p$ .
Poisson regression (with offset $t$ ): $\log(Y)=\log(t)+\beta_0+\beta_1X_1+\dots+\beta_pX_p$ .

🔎 Reference z / t values (common quantiles)

One-sided $\alpha=0.05$ critical: $z\approx 1.645$ .
Two-sided $\alpha=0.05$ critical: $z_{0.975}\approx 1.959964$ .
Useful R-derived values: $\text{qnorm}(0.95)\approx 1.644854$ , $\text{qnorm}(0.975)\approx 1.959964$ .

✅ Practical tips for exam-style problems

Always state hypotheses in context and specify whether the test is one-sided or two-sided.
Show formula, substitute numbers, compute the test statistic, state critical value or p-value, then give a contextual conclusion.
For contingency tables, check expected counts (all should be reasonably large for the Chi-squared approximation to hold).
For regression interpretation, comment on significance (p-values), sign and magnitude of coefficients, goodness-of-fit ( $R^2$ ), and practical uncertainty (residuals, prediction intervals not just point predictions).

These notes summarise the core tools used in the sample exam: descriptive summaries and plots, z-tests for mean differences, Chi-squared tests for independence, and interpretation of multiple linear regression output.

Sign up to read the full notes

It's free — no credit card required

Already have an account?

Create your own study notes

Turn your PDFs, lectures, and materials into summarized notes with AI. Study smarter, not harder.

Get Started Free

Quantitative Data Analysis and Visualisation — Comprehensive Study Notes Summary & Study Notes