A Data Analysis Study of Temperature Based on CALCOFI Data.

Model Accuracy (R²)

85.6%

Variation in temperature explained by our final model.

Margin of Error

± 1.16°C

Average prediction error.

Primary Driver

Depth

The dominant structural filter of the ocean.

1. Surface Level: Geographic Spread & Seasons

Geography and time of year dictate the initial thermal behavior and "noise" of the water mass at the surface.

Visualization 1: A Clear Thermal Gradient

Water in the Southern regions begins with significantly higher surface temperatures (20-25°C). This elevated heat creates immense "atmospheric noise" compared to the highly stable Northern regions (10-15°C).

Interact with the map to explore individual station temperature averages.

Seasonal Impact on Temperature vs. Salinity — Faceted plot showing the relationship between Temperature and Salinity across seasons.

Visualization 2: Does the time of year change the rules?

The time of year only changes the rules at the surface. Notice the shallow water data clusters (right); they clearly shift across the seasons (Fall, Spring, Summer, Winter).

However, look at the deep ocean (>200m) on the left. The trend lines perfectly overlap. This mathematically proves the deep ocean is "timeless" and thermally isolated from the sun.

2. The Depth Divide: Two Ocean Worlds

The ocean is not one continuous gradient; it is physically fractured into distinct zones at the 200-meter mark.

Visualization 3: Average Temp by Depth

The shallow ocean layer (0-200m) is an open system, almost twice as warm as the deeper layer.

Our ANOVA analysis (F-value: 715,208, p < 2e-16) proves a massive structural shift at 200m. Once past this line, the ocean becomes a closed system governed by density.

Visualization 4: The Flaw in Simple Linear Models

Temp vs Depth Underfitting — A single variable linear regression (red line) fails to capture the true ocean curve.

If we try to predict temperature based only on depth, the model underfits. The ocean's temperature drops violently and then levels out, proving we need a multi-variable approach.

Visualization 5: Salinity Stabilization

Temperature vs Salinity by Depth — The chaotic surface vs. the strict laws of the deep.

In shallow water (right), temperature and salinity form a scattered cloud—decoupled by atmospheric noise. Below 200m (left), depth acts as a quality-control filter, forcing water to be consistently cold and salty.

3. Chemical Correlation & Predictive Modeling

Combining all variables to hunt for the optimal predictive equation.

Correlation Heatmap — Strong interdependencies among oceanic predictors.

Visualization 6: The Multicollinearity Trap

Nutrients (Phosphate, Nitrate, Silicate) exhibit near-perfect correlations with each other (0.97 to 0.99) and an inverse relationship with Oxygen (-0.98).

To combat this redundancy, we tested Ridge Regression against Standard Linear Regression. Surprisingly, the Standard model won on unseen test data (MAE: 0.899 vs 0.962)—the physical laws governing these chemicals are simply that stable.

Visualization 7: Actual vs. Predicted Results

The Final Prescriptive Equation:
Temp = -53.61 - (0.014*depth) + (2.26*salinity) - (5.49*phosphate) - (0.13*nitrate) + (0.13*silicate) - (0.04*o2sat)

This scatterplot visualizes our 85.6% accuracy. For cooler temperatures (5°C to 15°C), the model is exceptionally precise. However, above 15°C, it under-predicts the heat, struggling to capture the chaotic sun-driven spikes of the shallow surface water.

Conclusion

Ultimately, this research confirms that data-driven insights are one of the best approaches to study temperature. From observing the Pacific Current to analyzing deep-sea shifts, this study validates the power of integrated modeling of oceanic variables.

Packages & Technical Methodology

Database queries, encoding fixes, and the software ecosystem utilized.

Software Ecosystem: Packages & Modules

MSQL Modules Utilized:

Module name	Why do we use it
Task	To import files into the data as tables faster.
Queries	Join Queries to join the tables.

R Packages Utilized:

Package	Why do we use it
tidyverse	Reading CSV, cleaning, plotting, tables
janitor	Standardize messy column names
stringi	Safely convert strange characters to UTF-8
rsample	Train/Test split, including grouped split by cruise
yardstick	Metrics (R2, RMSE, MAE) to compare models fairly
broom	Tidy model output for reporting
ppcor	Partial correlation (controls for depth confounding)
car	VIF for multicollinearity
glmnet	Ridge/Lasso regression (stable with overlapping predictors)
Matrix	Sparse model matrices
leaflet	Interactive Geographic data mapping

Project Files & Appendix

Reproduce these findings by downloading the datasets, scripts, and documentation below.

📊

Data Sets For this Analysis

CALCOFI raw bottle data, raw cast data, Merged DataSets can be downloaded via this link since datasets are huge, cannot be uploaded to the GitHub repo, and created as the zipped files

Download the Zipped Files

📖

Data CodeBook

CALCOFI variable definitions.

Download Excel

📖

DataSet Readme instructions

CALCOFI DataSet Readme instructions

Download pdf

📖

Technical Documentation

CALCOFI Technical Documentation Step by Step