A Data Analysis Study of Temperature Based on CALCOFI Data

Afsana Mimi

Professor Howard Everson

Data Analysis and Visualization, The Graduate Center

Decoding the California Current

Can we accurately predict ocean temperatures based entirely on chemical and physical markers? This dashboard explores the CalCOFI dataset to tell the story of the California Current System.

Moving beyond standard single-variable analysis, we bridge the gap between raw, chaotic oceanographic data and actionable structural insights—proving that the ocean operates not as one random body of water, but as distinctly organized chemical layers.

Model Accuracy (R²)

85.6%

Variation in temperature explained by our final model.

Margin of Error

± 1.16°C

Average prediction error.

Primary Driver

Depth

The dominant structural filter of the ocean.

1. Surface Level: Geographic Spread & Seasons

Geography and time of year dictate the initial thermal behavior and "noise" of the water mass at the surface.

Visualization 1: A Clear Thermal Gradient

Water in the Southern regions begins with significantly higher surface temperatures (20-25°C). This elevated heat creates immense "atmospheric noise" compared to the highly stable Northern regions (10-15°C).

Interact with the map to explore individual station temperature averages.

Seasonal Impact on Temperature vs. Salinity
Faceted plot showing the relationship between Temperature and Salinity across seasons.

Visualization 2: Does the time of year change the rules?

The time of year only changes the rules at the surface. Notice the shallow water data clusters (right); they clearly shift across the seasons (Fall, Spring, Summer, Winter).

However, look at the deep ocean (>200m) on the left. The trend lines perfectly overlap. This mathematically proves the deep ocean is "timeless" and thermally isolated from the sun.

2. The Depth Divide: Two Ocean Worlds

The ocean is not one continuous gradient; it is physically fractured into distinct zones at the 200-meter mark.

Visualization 3: Average Temp by Depth

The shallow ocean layer (0-200m) is an open system, almost twice as warm as the deeper layer.

Our ANOVA analysis (F-value: 715,208, p < 2e-16) proves a massive structural shift at 200m. Once past this line, the ocean becomes a closed system governed by density.

Average Temperature by Depth Layer

Visualization 4: The Flaw in Simple Linear Models

Temp vs Depth Underfitting
A single variable linear regression (red line) fails to capture the true ocean curve.

If we try to predict temperature based only on depth, the model underfits. The ocean's temperature drops violently and then levels out, proving we need a multi-variable approach.

Visualization 5: Salinity Stabilization

Temperature vs Salinity by Depth
The chaotic surface vs. the strict laws of the deep.

In shallow water (right), temperature and salinity form a scattered cloud—decoupled by atmospheric noise. Below 200m (left), depth acts as a quality-control filter, forcing water to be consistently cold and salty.

3. Chemical Correlation & Predictive Modeling

Combining all variables to hunt for the optimal predictive equation.

Correlation Heatmap
Strong interdependencies among oceanic predictors.

Visualization 6: The Multicollinearity Trap

Nutrients (Phosphate, Nitrate, Silicate) exhibit near-perfect correlations with each other (0.97 to 0.99) and an inverse relationship with Oxygen (-0.98).

To combat this redundancy, we tested Ridge Regression against Standard Linear Regression. Surprisingly, the Standard model won on unseen test data (MAE: 0.899 vs 0.962)—the physical laws governing these chemicals are simply that stable.

Visualization 7: Actual vs. Predicted Results

The Final Prescriptive Equation:
Temp = -53.61 - (0.014*depth) + (2.26*salinity) - (5.49*phosphate) - (0.13*nitrate) + (0.13*silicate) - (0.04*o2sat)

This scatterplot visualizes our 85.6% accuracy. For cooler temperatures (5°C to 15°C), the model is exceptionally precise. However, above 15°C, it under-predicts the heat, struggling to capture the chaotic sun-driven spikes of the shallow surface water.

Actual vs Predicted Temperature

Conclusion

Ultimately, this research confirms that data-driven insights are one of the best approaches to study temperature. From observing the Pacific Current to analyzing deep-sea shifts, this study validates the power of integrated modeling of oceanic variables.

Packages & Technical Methodology

Database queries, encoding fixes, and the software ecosystem utilized.

Software Ecosystem: Packages & Modules

MSQL Modules Utilized:

Module nameWhy do we use it
TaskTo import files into the data as tables faster.
QueriesJoin Queries to join the tables.

R Packages Utilized:

PackageWhy do we use it
tidyverseReading CSV, cleaning, plotting, tables
janitorStandardize messy column names
stringiSafely convert strange characters to UTF-8
rsampleTrain/Test split, including grouped split by cruise
yardstickMetrics (R2, RMSE, MAE) to compare models fairly
broomTidy model output for reporting
ppcorPartial correlation (controls for depth confounding)
carVIF for multicollinearity
glmnetRidge/Lasso regression (stable with overlapping predictors)
MatrixSparse model matrices
leafletInteractive Geographic data mapping

Project Files & Appendix

Reproduce these findings by downloading the datasets, scripts, and documentation below.

📊

Data Sets For this Analysis

CALCOFI raw bottle data, raw cast data, Merged DataSets can be downloaded via this link since datasets are huge, cannot be uploaded to the GitHub repo, and created as the zipped files

Download the Zipped Files
📖

Data CodeBook

CALCOFI variable definitions.

Download Excel
📖

DataSet Readme instructions

CALCOFI DataSet Readme instructions

Download pdf
📖

Technical Documentation

CALCOFI Technical Documentation Step by Step

Download pdf
💻

R Analysis Script

Full ML and statistical breakdown.

Download R Script
⚙️

SQL Database Script

Merges and initial fast-filtering.

Download SQL
🗺️

Leaflet Map Data

Aggregated UI geographic JSON.

Download JSON