Technical case study brief

Student Outcome Intelligence Platform

A concise, formula-led answer: governed data spine, point-in-time features, capacity-based risk bands, fairness checks, and advisor action.

Open slide deck Open demo dashboard

Formula spine

ri,t=P(Yi,t+h=1|Xi,t)Risk is estimated from features available at the scoring date.
usable=1[event_timetavailable_att]Historical simulation blocks late-arriving or future records.
τ=quantile(r, 1capacity / N)Thresholds are tied to real advisor capacity.
Run summary

student-risk-platform v0.1

Pilot-term scoring — two faculties, full-cohort fairness review.

cohort = spring 2026  ·  horizon h = 4 weeks
Cohort 35,000 students · 8 faculties
Red band 220 τred ≈ 0.82
Amber band 380 τamber ≈ 0.62
ROC-AUC 0.958 holdout, autumn 2025
Recall@C 0.68 population, capacity-anchored
Fairness gap 0.13 review · intl. students 0.58

Solution Map

Unify signalsSIS, LMS, ERP, and campus records land with source metadata and freshness checks.
Preserve historyIdentity mapping and type-2 student status prevent current-state leakage.
Score safelyPoint-in-time features feed calibrated Azure ML risk bands sized to advisor capacity.
Act and auditPower BI shows reasons, trends, RLS-scoped students, intervention status, and access logs.
Risk
ri,t=P(Yi,t+h=1|Xi,t)

Probability of next-term non-continuation.

Leakage
event_timetavailable_att

Only historically available records enter features.

Capacity
τred=quantile(r, 1Cred / N)

Red threshold matches advisor capacity.

Fairness
gap=max(metricg)min(metricg)

Audit recall, FPR, calibration, and flag rate by group.

No sections match the current search or view filter.

Student Outcome Intelligence Platform

Concise case-study answer: build an Azure-based student-risk platform, not just a dropout model. The deliverable is a point-in-time, auditable advisor queue that turns SIS, LMS, ERP, and campus signals into explainable support actions.

This long-form solution mirrors the presentation deck chapter-for-chapter, with deeper background, alternatives considered, and the reasoning behind each choice. Every diagram, table, and formula in the slides reappears here, surrounded by the prose context that did not fit on the slide. The six chapters track the deck's bottom progress bar — Brief & thesis, Architecture, Data model, Model & math, Action & ethics, Deployment — and each subsection title matches a slide title in that chapter.

Brief And Thesis

This chapter sets up the problem and the design stance. It restates what the university actually asked for, names the structural reasons the work is harder than a single ML model, declares the thesis (most of the value lives in the trustworthy data spine, not in model choice), and locks in the assumptions that make the target predictable, auditable, and safe to act on. Everything in the later chapters is a direct consequence of the four moves made here.

Five Steps From Raw Signals To Advisor Action

Before the architecture, it is worth stating the end-to-end shape of the platform in five steps so each later chapter has a clear place in the pipeline.

Step What happens Where it lives in this document
Acquire SIS, LMS, ERP, and campus signals are pulled by Azure Data Factory and Event Hubs into ADLS Gen2 bronze landings with full source metadata. Architecture chapter.
Govern Canonical student identity, type-2 history of mutable status, and Purview lineage and data contracts make the silver layer trustworthy. Data model chapter.
Model A calibrated, interpretable risk score with capacity-anchored bands and fairness checks runs in Azure ML. Model & math chapter.
Deploy Azure ML registry releases the model; Power BI / Fabric delivers the advisor queue with row-level security and reason codes. Action & ethics chapter.
Operate Drift, freshness, and fairness monitors plus an immutable audit log keep the score defensible across releases. Deployment chapter.

The five steps are deliberately sequential. Skipping a step earlier in the chain — for example, modelling before identity is canonical — does not save time, it just moves the cost into incident response after launch.

The Brief Is A Platform Design Problem, Not A Model-Only Task

The university has 35,000 students across 8 faculties and wants to identify students at risk of dropping out before grades, withdrawal, or status records make the problem visible. The signals already exist, but they are split across four independent systems with different owners, schemas, and update rhythms.

Source Signal Design implication
Student Information System Identity, enrolment, programme, grades, status, graduation, leave, withdrawal history. Preserve effective-dated student history; never rely only on current state.
Learning Management System Logins, course access, submissions, forums, video watch time, quizzes. Normalize high-frequency behavior by course, week, and study mode.
ERP Financial aid, tuition status, scholarships, balances, overdue payments. Join through canonical identity and keep available_at timestamps.
Campus systems Library access, WiFi presence, building entry. Aggregate to privacy-aware engagement signals and calibrate by study mode.

The deliverable described in the brief is not a binary classifier. It is a decision support system that has to be reproducible, explainable, fair, and operable inside a real advisor workflow with finite capacity. A model that cannot show its working, or that quietly leaks future information into training, is not just lower quality — it is unusable in a regulated context. That is why the chapter title from the deck deliberately frames this as a platform design problem.

The Hard Part Is Trustworthy History, Not The Model

In the deck this slide reframes effort allocation: the industry expectation is roughly 20 percent data, 80 percent model; the production reality on a problem like this is the inverse. Identity resolution, type-2 history, point-in-time features, lineage, and contracts are where most of the design risk lives. The model itself is a small head sitting on top of that spine.

Friction What goes wrong Design response
Inconsistent IDs The same student is 123 in SIS, s.last@univ in LMS, and 0001234 in ERP. Naive joins lose people. Canonical identity_map with match_confidence and validity windows.
Different update rhythms LMS streams hourly, finance posts weekly, status changes are entered by hand. Joining on "today" mixes truths from different clocks. event_time and available_at on every row, plus weekly point-in-time snapshots.
Current-state source tables SIS overwrites status when a student withdraws, so last term's "active" student now reads "withdrawn". History is silently destroyed. Type-2 student_status_history materialized in silver before any feature is computed.
Advisor capacity is fixed A score with no operating budget produces a queue no one works. Bands sized by quantile of the score distribution against advisor headcount, not a fixed cutoff.

The thesis has a sharp consequence: if the foundation is wrong, every model retrained on top of it inherits the same blind spots. Calibration drifts because the historical labels are wrong. Fairness audits show "no problem" because the protected-attribute history was already overwritten. Explainability collapses because a feature that looked like prior-term GPA was actually next-term GPA leaking back through a late update. The early chapters of the deck and this document are therefore disproportionately about boring infrastructure — that is where the leverage is.

One-line Answer

Create a governed Microsoft Azure and Fabric data spine that resolves identity, stores mutable student records as history, builds leakage-safe weekly feature snapshots, predicts next-term non-continuation risk, and delivers risk bands plus reason codes to advisors.

Scope The Target So The Model Can Be Defended

Six assumptions box the problem so that what gets built is small enough to defend and large enough to be useful. Each one shows up later in the design as a specific control.

Area Assumption Design consequence
Outcome Risk means next eligible term non-continuation, excluding graduation, exchange, approved leave, and administrative corrections. Labels need exclusions and a closed observation window.
Decision use The score is advisor decision support only. No automated punitive, academic, financial, or disciplinary action.
Timing Advisors need enough lead time before final grades or official withdrawal. Score weekly during the term after early engagement signals exist.
Privacy The platform handles personal data and may be challenged. Build DPIA, minimization, lineage, access audit, model cards, and explanation support from day one.
Fairness Demographics are needed to detect bias but should not rank students. Use protected attributes for audit only; never as advisor reason codes.
Capacity Advisor headcount per faculty is the binding operational constraint. Bands are sized by quantile against advisor capacity, not a detached score cutoff.

Two of these are worth lingering on, because they generate most downstream behaviour. Outcome scope is what the deck assumption card calls "term-end attrition only — exclusions defined upfront." Without that exclusion list the model will learn that exchange semesters look like dropout, and the queue will fill with students who are abroad on programme. Capacity-anchored bands is the operational rule that the model serves a finite advisor team — the threshold lives downstream of staffing, not upstream of it. If next term's capacity drops, the same model still produces a workable queue by sliding τ_red right; if a fixed score cutoff is used instead, the queue silently overshoots the team.

The remaining four assumptions are guardrails against well-known failure modes. Treating the score as advisor decision support means a wrong prediction never directly harms a student — a human is always in the loop. Weekly cadence concentrates the score on the window where intervention is still useful (mid-term), not after grades are filed. The privacy posture is conservative on purpose: a DPIA, minimization, and immutable access audit are cheaper to build at week 0 than retrofitted at week 30. Fairness for audit only is the explicit answer to a common confusion — you cannot prove fairness without measuring outcomes by group, but those same attributes have no business ranking a student.

Architecture

This chapter describes the Azure shape of the system: which services do which job, where data lives at each stage, what governs and audits the flow, and why a lakehouse with bronze, silver, and gold layers is the right backbone for an advisor decision-support tool that has to remain explainable and reproducible. The single architecture diagram from the deck reappears here, expanded into a service-by-service decision log so that each component can be defended individually.

Azure Lakehouse Plus Governed ML Workflow

Reference architecture diagram

Layer Azure decision Purpose Output
Sources SIS, LMS, ERP, campus systems Source-owned files, APIs, database extracts, and events. Raw operational evidence.
Ingestion Azure Data Factory / Fabric Data Factory, Azure Event Hubs, managed identities. Scheduled extraction, event capture, schema checks, freshness logging. Landing records with source metadata.
Raw lake ADLS Gen2, Event Hubs Capture, immutable folders. Preserve original records exactly as received. Replayable bronze evidence.
Curated lakehouse Fabric OneLake Lakehouses with Delta tables. Identity resolution, type-2 history, quality contracts, deduplication. Trusted silver tables.
Feature products Fabric SQL endpoint, notebooks, warehouse where useful. Student-term facts, point-in-time weekly features, closed labels. Gold training and scoring tables.
ML lifecycle Azure ML registry, pipelines, Responsible AI dashboard. Training, validation, calibration, fairness, model cards, thresholds. Versioned model release package.
Advisor delivery Power BI / Fabric app, Teams notification, row-level security. Risk bands, reason codes, score trend, intervention workflow. Audited support queue.
Control plane Purview, Entra ID, Key Vault, Monitor, Log Analytics, Azure Policy. Catalog, lineage, access, secrets, alerts, drift, approvals. Evidence for operations and audit.

The architecture is intentionally a lakehouse, not a classical data warehouse. The reason is the mix of structured records (SIS, ERP) and high-frequency, semi-structured event streams (LMS clicks, campus access). A warehouse-first design forces the streaming side into nightly extracts and loses event-level signal. A pure data-lake design makes governance, contracts, and SQL access painful. OneLake with Delta tables and a SQL endpoint sits between these, keeping bronze immutability for audit and silver/gold relationality for analytics.

A second deliberate choice is splitting the ingestion plane between Azure Data Factory and Azure Event Hubs. Data Factory handles the slow, scheduled, well-typed feeds (SIS daily extracts, ERP weekly postings); Event Hubs handles the high-frequency campus and LMS events. Both write to ADLS Gen2 in raw form before any transformation. That split is what allows a single feed to slow down (e.g., the SIS export runs late on a Sunday) without dragging the rest of the pipeline with it, and it lets the team retire one ingestion path without touching the other.

Service Choices

Need Azure service Why
Mixed files, APIs, events, update frequencies. Data Factory / Fabric Data Factory plus Event Hubs. Supports batch extracts and high-frequency streams without forcing one ingestion pattern.
Replayable raw evidence. ADLS Gen2 with lifecycle policies. Keeps immutable source records for audit, backfill, and investigation.
Governed analytics layer. Fabric OneLake lakehouses with bronze, silver, gold layers. Separates raw evidence, trusted history, and publishable feature products.
Lineage and discovery. Microsoft Purview. Makes source, transformation, and data-product lineage searchable.
Model release and review. Azure ML plus Responsible AI dashboard. Versions model, thresholds, fairness review, and explanation artifacts together.
Advisor access. Power BI / Fabric app with Entra groups and RLS. Lets advisors see only assigned students or faculties.

Every feed lands with source_system, source_record_id, event_time, available_at, ingested_at, schema version, source URI, pipeline run ID, and Purview asset reference. Breaking schema changes are quarantined before silver promotion. This metadata is what makes a prediction reproducible six months later — without it, "what did the system know on the day this score was generated" becomes impossible to answer, which is the same as having no audit at all.

A few alternatives were rejected on purpose. Synapse dedicated SQL pools were considered for the gold layer but added a second compute engine, a second governance surface, and a hard split between batch and lakehouse data; Fabric SQL endpoint over the same Delta tables removes that split. Databricks was considered for ingestion and feature engineering but would have duplicated tooling that Fabric already provides natively, and the Purview lineage story is cleaner end-to-end inside Azure-first services. Push-only event ingestion from each source was considered and rejected because not all source systems can push reliably — Data Factory's pull model survives source outages without losing events.

Data Model

This chapter is the operational heart of the platform: how raw payloads from four very different source systems become a trustworthy, point-in-time feature table that a model can be trained on without leakage. It walks through the bronze→silver lineage, the silver→gold feature products, the catalogue of tables that hold history, and the explicit admission rule that keeps future and late-arriving records out of the snapshot. If any single chapter has to be right for the rest of the platform to work, it is this one.

Bronze To Silver: Raw Payloads To Cleansed Tables

Bronze to Silver lineage

The bronze layer is intentionally close to dumb. Files and events land in ADLS Gen2 partitioned by source, ingestion date, and pipeline run, with the original payload preserved. Nothing in bronze is corrected, deduplicated, or joined. The reason is that audit, backfill, and investigation all depend on being able to replay the exact bytes the platform received. When a number on a Power BI tile is questioned six months later, the bronze layer is what proves what the source actually said on that day.

The promotion to silver is where most of the data engineering work lives. Each source gets a staging model that handles three jobs: cleansing (typing, null handling, encoding), conforming (mapping into the canonical column names and units used downstream), and contract enforcement (Great Expectations or equivalent rules block obviously broken loads from reaching silver). For mutable entities, especially student status, programme, and faculty, silver materialises a Type-2 slowly-changing-dimension with valid_from, valid_to, and recorded_at. For event streams, silver collapses obvious duplicates and applies watermarking so a late-arriving event does not silently rewrite history.

The end state of silver is a small set of trusted, conformed tables that everything downstream is allowed to read. Anything that needs raw bronze data has to do so through an explicit, audited path — by design, an analyst building a feature in the gold layer should never need to touch bronze.

Silver To Gold: Features And Scored Predictions

Silver to Gold lineage

Table Key columns Role
identity_map canonical_student_id, source_system, source_person_id, valid_from, valid_to, match_confidence Joins fragmented systems safely.
student_status_history status, faculty, programme, valid_from, valid_to, recorded_at Prevents current-state leakage.
ingestion_audit pipeline_run_id, source_system, schema_version, record_count, watermark, purview_asset_id Links data products back to pipeline evidence.
fact_enrollment_term term_id, credits_registered, prior_gpa, academic_standing Academic baseline.
fact_lms_activity_daily activity_date, login_count, course_views, assignment_due, assignment_submitted Digital engagement.
fact_financial_snapshot status_date, outstanding_balance_nok, payment_overdue_days, aid_status Financial friction.
fact_campus_activity_daily event_date, building_entry_count, library_entry_count, wifi_minutes Aggregated physical engagement.
feature_student_week as_of_date, term_week, feature columns, feature_snapshot_hash Model input.
risk_prediction prediction_id, model_version, risk_score, risk_band, top_reasons Advisor output.
access_audit viewer_user_id, purpose, timestamp, prediction_id, fields_returned Accountability.

feature_student_week is the table that the model actually consumes, and it deserves a closer look. Each row is keyed on (canonical_student_id, as_of_date) and is built by joining the silver fact tables through identity_map, filtered by the point-in-time admission rule below. The feature_snapshot_hash column is the cheapest reproducibility tool in the design: it is a stable hash of the feature vector that ends up in the prediction record, so a future audit can confirm that the prediction it reproduces from the data spine matches the prediction the advisor saw.

risk_prediction is the publication boundary. Once a row lands here, it is immutable. New predictions for the same student-week create a new row keyed by prediction_id and model_version. This append-only stance is what lets the platform answer questions like "which model version produced the score this advisor saw, and why is today's score different" without ambiguity.

Every Row Has An As-Of Date

Every row in feature_student_week is built as of t = as_of_date. The point-in-time admission rule is the single most important control in the platform — it is what turns a pile of source records into a trustworthy training set.

Historical Availability

usable(record, t)=1[event_timetavailable_att]

Future records and late-arriving records are blocked during both training and scoring.

Digital Engagement

lms_logins_14d(i,t)=login_counti,dford[t13, t]

Recent LMS activity is measured only from dates already visible at the scoring date.

Assessment Gap

missing_assignments(i,t)=max(due_to_datei,tsubmitted_to_datei,t, 0)

The feature compares due work and submitted work as of the snapshot date.

Financial Signal

overdue_flag(i,t)=1[payment_overdue_daysi,t>0]

Finance data is joined through effective-dated snapshots and canonical identity.

Point-in-time admission rule for source recordsDiagram 1
ADMISSIBLE ⋅ event_time ≤ t AND available_at ≤ t BLOCKED ⋅ FUTURE OR LATE t − 4w t − 3w t − 2w t − 1w t + 1w t + 2w scoring date t Prior-term GPA event_time available_at USABLE Tuition payment event_time available_at USABLE Late-recorded grade event_time available_at BLOCKED ⋅ LATE Final exam grade event_time available_at BLOCKED ⋅ FUTURE event_time available_at admitted late-arriving (blocked) future (blocked)

A record enters the feature snapshot only if the event happened at or before t and was visible to the platform at or before t. The late-recorded grade is real history but its available_at is in the future, so it is rejected by the second predicate. Reproducing a prediction means rebuilding both clocks at t, not just the calendar one.

The two-clock rule (event_time <= t AND available_at <= t) is the difference between a leakage-safe pipeline and one that quietly cheats. The first predicate keeps future records out — easy. The second predicate keeps records out that are real history but were not yet visible to the platform at t, e.g., a grade entered three weeks late by a faculty office. A naive design would let those records into a training row dated t because their event_time is before t; the model would then learn a feature that is impossible to compute at scoring time. The two-clock rule makes that impossible.

Labels are kept separate from features until the outcome window closes. Each feature snapshot is stored with its hash so predictions can be reproduced exactly. This is the single hardest discipline to maintain in practice — the temptation to "just add one more recent variable" to features is constant — and the design enforces it through table-level separation rather than relying on reviewer attention.

Model And Math

This chapter is about the model itself, but the framing is deliberate: most of the design effort already happened upstream. The model is a small head sitting on top of a trustworthy data spine, and its job is to be calibrated, defensible, and easy to recalibrate when capacity or population changes. The chapter covers the formula spine the platform has to defend, the choice of an interpretable baseline, the capacity-anchored thresholds that turn a probability into a queue, and the validation metrics that tell you whether the queue is actually working.

The Math The Platform Has To Defend

Start with calibrated logistic regression or explainable gradient boosted trees in Azure ML. The goal is useful advisor prioritization, not only aggregate accuracy.

Risk Score

ri,t=P(Yi,t+h=1|Xi,t)

Risk is the probability of next-term non-continuation using only features available at scoring date t.

Interpretable Baseline

logit(ri,t)=β0
+β1 prior_gpai,t
+β2 missing_assignmentsi,t
+β3 days_since_lmsi,t
+β4 payment_overdue_daysi,t
+β5 campus_dropi,t
+β6..k academic_contexti,t

Capacity Threshold

τred=quantile(ri,t, 1Cred / N)
τamber=quantile(ri,t, 1(Cred + Camber) / N)

Thresholds are tied to real advisor capacity rather than a detached score cutoff.

Calibrated logistic regression is the recommended baseline because every coefficient maps directly to a reason code. When β₃ on days_since_lms is positive and material, "you have not signed in for 11 days" can appear in the advisor's UI as an honest explanation. With a tree-ensemble black box, the same statement is at best an approximation, and at worst a post-hoc justification. Gradient boosted trees with monotonic constraints and SHAP explanations are an acceptable second choice when the baseline plateaus; deep models are not used because the marginal AUC gain does not pay for the explainability and audit cost.

Calibration matters as much as discrimination here. A score of 0.7 has to actually mean "70 percent of these students will not continue," because otherwise capacity-anchored bands lose their meaning and reason codes mislead advisors. Platt scaling or isotonic regression on a held-out term takes calibration from "approximately right" to "audit-defensible." Expected Calibration Error (ECE) is reported alongside Recall@C in the validation table for exactly this reason.

Start Interpretable, Calibrated, And Capacity-Aware

Capacity-anchored thresholds on the score distributionDiagram 2
SCORE DISTRIBUTION ACROSS N STUDENTS ⋅ THRESHOLDS PLACED BY QUANTILE τ amber ≈ 0.62 τ red ≈ 0.82 0.0 0.2 0.4 0.6 0.8 1.0 risk score rₐ,ₜ GREEN ⋅ monitor remaining cohort AMBER ⋅ outreach next 380 students RED ⋅ advisor top 220 students N ≈ 35,000

Bands are sized to advisor reality, not to a detached score cutoff. If next term's capacity drops, the same model produces a smaller red band by sliding τred right; if you raise the bar, the model never silently overshoots staffing.

The capacity-anchored design is what stops the queue from being noise. A common failure mode of risk models is to ship a "0.5 cutoff" by default; in this domain that produces queues of several thousand students that no advisor team can work, the queue gets ignored, and the platform quietly dies. By tying τ_red to the top C_red students by quantile, the queue is always exactly the size the team can act on. The cost is that absolute risk levels can drift between terms — a "red" student in a calmer term is genuinely lower-risk than a "red" student in a hard term — but that drift is visible in the score itself and can be reported alongside the band.

Check Formula Why it matters
Precision at capacity Precision@C = true_positives_in_top_C / C Are advisor slots used well?
Recall at capacity Recall@C = true_positives_in_top_C / all_actual_positives How much actual risk does the queue catch?
Lead time lead_time_i = outcome_date_i - first_red_or_amber_score_date_i Is the signal early enough to act?
Calibration ECE = sum_b (n_b / N) * abs(mean(Y_b) - mean(r_b)) Do predicted probabilities mean what they claim?
Fairness gap gap = max(metric_g) - min(metric_g) Detect material group differences in recall, FPR, FNR, ECE, and flag rate.

Confusion matrix at τred

The threshold is set so the at-risk count fits advisor capacity. Recall and lead time are reported alongside; precision is informative, not the optimization target.

Predicted no-risk Predicted at-risk
Actual no-risk TN ≈ 78% FP ≈ 5%
Actual at-risk FN ≈ 8% TP ≈ 9%

Cells are illustrative pilot-term shares of the 35,000-student population. The model card publishes the same matrix split by faculty, gender, age band, international status, and first-generation status. Any group whose FN rate diverges materially from the population is flagged for review before release.

The reason precision is "informative, not the optimization target" is operational. Once the queue is sized to advisor capacity, precision is bounded by base rate; chasing precision means under-flagging and missing students who should have been called. Recall@C and lead time tell the operationally honest story: of the students who actually did not continue, what share landed in the queue, and how many weeks ahead of the outcome did the queue first surface them?

Action And Ethics

A score that reaches an advisor without context is just noise; a score that reaches an advisor with the wrong group hidden in its error structure is harm. This chapter covers the two halves of how the platform makes itself accountable: the advisor-facing UI that turns a probability into a decision, and the fairness audit that runs every model release before any prediction reaches a human. Both are explicit answers to the assumption that the model is decision support only, never autonomous action.

Make The Prediction Actionable

The advisor view is intentionally narrow. It shows only what an advisor needs to make a defensible decision about a specific student in the next two weeks, and nothing else. The deck demonstration screen has three rows — green, amber, red — each with the student, programme, the top reasons, and a recommended support route.

  • Student, programme, risk band, score range, and score trend.
  • Top three actionable reasons with source freshness.
  • Recommended support route: academic check-in, financial-aid referral, study-skills support, or wellbeing referral.
  • Contact status, notes, intervention outcome, and "not relevant" feedback.
  • Access evidence showing who viewed which prediction, when, and for what purpose.

Two design choices in the advisor view are worth defending explicitly. First, reason codes are bounded to actionable signals. The model may use 30+ features, but the advisor only sees the three that drove the score most for this student-week, and only ever from the actionable set (assignments, LMS activity, finance, campus). Demographic features — even when they carry signal — are never shown as reason codes because they are not legitimate grounds for a support call. Second, score trend matters more than the absolute score. A student whose risk has climbed two bands in three weeks is operationally more interesting than a student who has been red and stable; the UI surfaces the trend prominently for that reason.

The "not relevant" feedback channel and intervention-outcome capture are the unglamorous half of the workflow but they are what closes the loop. Advisor labels become the next training signal, both for the model (was this prediction useful?) and for fairness monitoring (is the queue under-serving a particular group?). Without that channel, the platform is write-only and gets stale fast.

Engagement Patterns Differ For Legitimate Reasons

Recall@C and calibration gap by audit groupDiagram 3
RECALL@C BY AUDIT GROUP ⋅ PILOT TERM, N = 35,000 population mean 0.68 Population 0.68 Faculty A ⋅ Engineering 0.71 Faculty B ⋅ Humanities 0.66 Female 0.70 International students 0.58 First-generation 0.65 0.40 0.50 0.60 0.70 0.80 gap = 0.13 ⋅ review Recall@C ⋅ share of true non-continuers in the advisor queue

Audit before release. The international-student row is the binding constraint — the queue catches a smaller share of their actual non-continuers, so the model card must explain why and the threshold review must decide whether a group-aware adjustment is justified before the model is approved for advisor use.

The fairness audit measures four metrics — recall, FPR, calibration error, and flag rate — across faculty, gender, age band, international status, first-generation status, and study mode. Each metric has a different failure mode: low recall in a group means the queue under-serves them; high FPR means a group bears advisor attention they did not need; calibration drift means the score has different meaning across groups; flag rate compared to base rate detects over- or under-flagging. The diagram above shows the recall view in the pilot term; the international-student row is the binding constraint and triggers the model card review.

The deck slide title — "engagement patterns differ for legitimate reasons" — is deliberately not a defence of the gap. International students do have legitimately different LMS and campus patterns (they study on different rhythms, often live on-campus differently, may use private study off platform). That is exactly why the audit exists: to detect when those legitimate patterns translate into the model under-serving the group, and to force a documented decision (group-aware threshold, recalibration, or accepting the gap) before release rather than after.

Protected attributes are retained for audit, not shown as advisor reason codes. Audit by gender, age band, international status, first-generation status, faculty, programme, and study mode happens at every model release and is recorded in the model card alongside the metrics table.

Deployment

This last chapter covers what happens after the model is good enough to release: the production controls that keep it good, the trade-offs the design accepts on purpose, the thirty-week roadmap that gets the platform from week 0 to a hardened pilot, and the open questions for the university whose answers shape thresholds and scope. It is the smallest chapter in design surface but the largest in operational lifetime — most of the platform's time is spent here, not in building the first model.

Production Controls

Control Production rule
Freshness Alert when a Data Factory/Fabric pipeline misses SLA or an Event Hubs stream falls behind.
Completeness Compare expected vs received enrollment and event counts by faculty and term.
Identity Track unmatched IDs and low-confidence matches; route exceptions to data stewardship.
Schema drift Block breaking column, type, or semantic changes before silver promotion.
Score drift Monitor score distribution, feature drift, and reason-code mix by term, faculty, programme, and study mode.
Governance Register assets and lineage in Purview; version model, features, data, thresholds, reason-code logic, and approvals.
Security Use pseudonymous modeling, restricted PII, encryption, private access where needed, Entra groups, Key Vault secrets, and Power BI/Fabric RLS.
Audit Store advisor access, data lineage, model release approval, and threshold approval in immutable audit tables.

The controls split cleanly into two groups. The first five (freshness, completeness, identity, schema drift, score drift) are about catching silent data degradation early — most production failures of risk models are not model failures but quiet upstream changes (a renamed LMS column, a new ERP code) that the model swallows without complaint. The last three (governance, security, audit) are the regulatory surface: when the platform is challenged, these controls produce the evidence trail that shows what was known, who approved it, and who saw what.

Two of the controls deserve a closer look because they are commonly skipped. Score drift monitoring by group is what catches fairness regressions between releases — the model card records the fairness gap at release, and weekly monitoring against that baseline is how a creeping gap becomes visible before it has done damage. Reason-code mix monitoring is the operational version: if a particular reason code suddenly explains 60 percent of red predictions in one faculty, that is almost certainly a data issue (the LMS in that faculty is exporting differently) rather than a real signal change.

Key Decisions I Would Defend

Decision Recommendation Why Accepted cost
Batch vs streaming Daily ingestion plus weekly scoring; Event Hubs capture for high-frequency streams. Good enough for advisor intervention and easier to govern. Hours of staleness in exchange for governable lineage.
Model complexity Interpretable first. Easier to defend, calibrate, explain, and approve. A few AUC points to keep reason codes honest.
Global vs local model One global model with faculty/programme/study-mode context. More stable at launch; local models need more data. Slightly weaker per-faculty fit; revisit once two terms of pilot data exist.
Raw vs aggregate exposure Keep raw restricted, expose advisor-useful aggregates. Preserves audit while reducing privacy risk. Two storage tiers and a feature contract instead of one flat surface.
Protected attributes Use for audit only. Needed to detect bias; not appropriate as reason codes. Storing demographics under audit, never as advisor reason codes.

Each of these trade-offs has a plausible counter-argument. Real-time scoring would let the platform react to a sudden engagement collapse the day it happens; the design rejects it because advisor intervention has a multi-day cycle anyway, and the operational and governance cost of a streaming feature store is large. A single global model is weaker than per-faculty models for the larger faculties; the design accepts that for launch because the smaller faculties simply do not have enough non-continuation events to fit a stable per-faculty model, and a heterogeneous quality story across faculties is harder to defend than a single calibrated global model. The intent is to revisit local models once two pilot terms have shipped.

Thirty-Week Rollout In Five Phases

Thirty-week delivery plan with phase swimlanes and milestonesDiagram 4
DELIVERY PHASES ⋅ W0 TO W30 ⋅ ONE GLOBAL MODEL, TWO PILOT FACULTIES W0 W4 W8 W12 W16 W20 W24 W28 W30 DPIA approved data spine GA model v1 pilot live RAI sign-off 1. Governance & foundations 2. Ingestion & history 3. Features & first model 4. Advisor pilot 5. RAI review & hardening foundations ingest ⋅ identity point-in-time features two-faculty pilot RAI review ⋅ harden phase 1 phase 2 phase 3 phase 4 phase 5 milestone

Five sequential phases with diamonds at the only hard gates. Governance review, monitoring, and RAI evidence-gathering keep running once the pilot is live — Phase 5 is the catch-up window for the parallel work that always bleeds in. If Phase 2 slips by a week, Phases 3 to 5 slide by a week and the gates simply move with them.

  1. Weeks 0-4: DPIA, lawful basis, Azure landing zone, residency decision, outcome definition, access model, data contracts.
  2. Weeks 4-10: configure ADLS Gen2/Fabric workspaces, ingest SIS/LMS/ERP/campus, identity map, type-2 history.
  3. Weeks 10-16: point-in-time Fabric feature products, labels, validation, first Azure ML model.
  4. Weeks 16-22: Power BI/Fabric advisor pilot with two faculties, RLS, feedback capture, intervention outcomes.
  5. Weeks 22-30: Responsible AI review, calibration, Azure Monitor alerts, Purview lineage, model registry, operational hardening.

The roadmap puts governance first deliberately. A common failure mode in projects like this is to rush ingestion, build a model, and only then discover that the lawful basis or the access model was never actually agreed — at which point the work has to unwind months later. Putting DPIA and access model in weeks 0–4 means the rest of the build happens against a fixed legal frame, not a moving one. The roadmap also places RAI review last, not first, on purpose: there is nothing meaningful to audit until the pilot has produced predictions and intervention outcomes.

Run the prototype:

python .\student_outcome_platform_demo.py --out outputs --students 1200 --seed 42

Brief Coverage And References

Brief requirement Where the solution addresses it
Fragmented systems with different formats, update frequencies, APIs, files, and events. Mixed Data Factory/Fabric ingestion, Event Hubs, source contracts, bronze metadata, schema drift controls.
Inconsistent identifiers. Canonical identity_map with source IDs, validity windows, and match confidence.
Student status changes over time. Type-2 student_status_history, effective-dated finance snapshots, and point-in-time joins.
Train only on available information. event_time <= t, available_at <= t, feature snapshots, and closed label windows.
Diverse population. Segmented baselines, protected-attribute audit, calibration checks, and reason-code monitoring.
Explain and audit. Risk band, score range, top reasons, source freshness, feature hash, model version, and access_audit.

Reference basis: Microsoft Fabric overview, medallion lakehouse architecture, Azure Event Hubs Capture, Azure Data Factory to Purview lineage, Azure ML Responsible AI dashboard, and Power BI row-level security.

Open Questions For The University

A short list of items the design intentionally leaves to discovery, because the answers shape thresholds and scope rather than the platform shape.

Topic Question Why it matters
Outcome definition Are exchange semesters, approved leave, and programme transfers all excluded from the non-continuation label, and how is that recorded? Mislabelled positives inflate recall and disguise model error.
Advisor capacity What is the actual weekly capacity by faculty, including peak weeks (mid-term and exam season)? Capacity sets τred and τamber; a guess produces an unbacked queue.
Source freshness SLAs What is the worst acceptable lag for SIS status, LMS submissions, and ERP balances? Determines whether weekly scoring catches the signal in time.
Privacy boundaries Is wellbeing or counselling data in scope, and under which lawful basis? Scope creep here turns the platform into a different DPIA.
Evidence for explanation Will advisors share specific reason codes with students, and in what register? Drives the reason-code style guide and the support-route catalogue.

These five questions are the only items the design refuses to guess on. Everything else — service choices, table shapes, validation metrics, fairness thresholds — is decided in the document above. These five are decided with the university because the right answer depends on policy and capacity that lives outside the engineering team, and a platform that pretends otherwise is one that has to be reworked the first time a real DPIA reviewer or faculty operations lead asks the question.