# Student Outcome Intelligence Platform

Concise case-study answer: build an Azure-based student-risk platform, not just a dropout model. The core deliverable is a point-in-time, auditable advisor queue that turns SIS, LMS, ERP, and campus signals into explainable support actions.

## Case Summary

The case is a 20-30 minute design presentation for a large Norwegian university with 35,000 students across 8 faculties. The university wants to identify students at risk of dropping out early enough for academic advisors to intervene, before the issue appears in official grade, withdrawal, or status records.

The hard problem is that the signals are split across four systems that were built independently:

- Student Information System: identity, enrolment, programme, faculty, grades, graduation, leave, withdrawal, and status history.
- Learning Management System: logins, course access, submissions, forums, video watch time, quizzes, and other high-frequency learning behavior.
- ERP: financial aid, tuition payment status, scholarships, balances, and overdue-payment signals.
- Physical campus systems: library access, WiFi presence, and building-entry activity.

The platform must handle inconsistent identifiers, changing student status, late-arriving records, schema changes, diverse engagement patterns, personal data, explainability, fairness, and auditability. The proposed solution is an Azure and Microsoft Fabric data spine that makes these records trustworthy, builds point-in-time features, scores dropout risk, and delivers an advisor queue with reason codes and access audit.

## Executive Answer

Create a Microsoft Azure and Fabric platform that resolves student identity, stores mutable SIS records as history, builds leakage-safe feature snapshots, predicts non-continuation risk during the term, and gives advisors risk bands plus reason codes.

Use Azure Data Factory or Fabric Data Factory for scheduled extracts, Azure Event Hubs for high-frequency LMS and campus events, ADLS Gen2 and OneLake for the medallion lakehouse, Microsoft Purview for catalog and lineage, Azure Machine Learning for model lifecycle, and Power BI or a Fabric app for advisor delivery.

Launch as daily ingestion with weekly scoring. Use streaming only where faster data changes an advisor decision; otherwise capture events into the lake and govern them like any other source.

## Assumptions And Scope

| Area | Assumption | Design consequence |
| --- | --- | --- |
| University size | 35,000 students across 8 faculties, with mixed full-time, part-time, international, and remote/hybrid study patterns. | Segment validation by faculty, programme, study mode, and term week instead of using one unqualified engagement baseline. |
| Decision use | The score supports advisor prioritization only. | No automated academic, disciplinary, or financial action is allowed from the score. |
| Outcome | Risk means next eligible term non-continuation, excluding graduation, exchange completion, approved leave, and administrative data corrections. | Labels need an exclusion table and a closed observation window before training. |
| Timing | Advisors need enough lead time to intervene before final grades or official withdrawal records. | Score in-semester on a weekly cadence after early engagement signals exist. |
| Data law and ethics | The platform handles personal data and may be challenged by students or regulators. | Build DPIA, purpose limitation, minimization, lineage, access audit, model cards, and appeal/explanation support from the start. |
| Cloud target | The university can deploy to an Azure tenant with Norwegian/EU residency controls and Microsoft 365 identity already in use. | Prefer Microsoft Fabric, ADLS Gen2, Azure ML, Purview, Entra ID, Key Vault, Monitor, and Power BI rather than a cloud-neutral stack. |
| Preparation time | The interview asks for a 20-30 minute presentation, not a full implementation plan. | Focus on defensible architecture choices, trade-offs, formulas, controls, and pilot path. |

## Formula Sheet

<div class="equation-stack">
  <section class="equation-panel">
    <h3>Target</h3>
    <div class="math-expr">Y<sub>i,t+h</sub> = 1</div>
    <p>Student <span class="math-note">i</span> does not continue in the next eligible term, excluding graduation, approved leave, and exchange completion.</p>
  </section>
  <section class="equation-panel">
    <h3>Risk Score</h3>
    <div class="math-expr">r<sub>i,t</sub> = <span class="math-fn">P</span>(Y<sub>i,t+h</sub> = 1 | X<sub>i,t</sub>)</div>
    <p>Risk is the probability of next-term non-continuation using only features available at scoring date <span class="math-note">t</span>.</p>
  </section>
  <section class="equation-panel">
    <h3>Interpretable Baseline</h3>
    <div class="equation-lines">
      <div class="equation-line"><span class="indent"></span><span><span class="math-fn">logit</span>(r<sub>i,t</sub>) = &beta;<sub>0</sub></span></div>
      <div class="equation-line"><span class="indent">+</span><span>&beta;<sub>1</sub> prior_gpa<sub>i,t</sub></span></div>
      <div class="equation-line"><span class="indent">+</span><span>&beta;<sub>2</sub> missing_assignments<sub>i,t</sub></span></div>
      <div class="equation-line"><span class="indent">+</span><span>&beta;<sub>3</sub> days_since_lms<sub>i,t</sub></span></div>
      <div class="equation-line"><span class="indent">+</span><span>&beta;<sub>4</sub> payment_overdue_days<sub>i,t</sub></span></div>
      <div class="equation-line"><span class="indent">+</span><span>&beta;<sub>5</sub> campus_drop<sub>i,t</sub></span></div>
      <div class="equation-line"><span class="indent">+</span><span>&beta;<sub>6..k</sub> academic_context<sub>i,t</sub></span></div>
    </div>
  </section>
  <section class="equation-panel">
    <h3>Advisor Capacity Threshold</h3>
    <div class="equation-lines">
      <div class="equation-line"><span class="indent"></span><span>&tau;<sub>red</sub> = <span class="math-fn">quantile</span>(r<sub>i,t</sub>, 1 - C<sub>red</sub> / N)</span></div>
      <div class="equation-line"><span class="indent"></span><span>&tau;<sub>amber</sub> = <span class="math-fn">quantile</span>(r<sub>i,t</sub>, 1 - (C<sub>red</sub> + C<sub>amber</sub>) / N)</span></div>
    </div>
    <p>Thresholds are tied to real advisor capacity rather than a detached score cutoff.</p>
  </section>
  <section class="equation-panel">
    <h3>Risk Band</h3>
    <div class="equation-rule">
      <div class="risk-band-rule"><strong>Red</strong><span>r<sub>i,t</sub> &ge; &tau;<sub>red</sub></span></div>
      <div class="risk-band-rule"><strong>Amber</strong><span>&tau;<sub>amber</sub> &le; r<sub>i,t</sub> &lt; &tau;<sub>red</sub></span></div>
      <div class="risk-band-rule"><strong>Green</strong><span>otherwise</span></div>
    </div>
  </section>
</div>

## Reference Architecture

![Reference architecture diagram](architecture_reference.svg)

| Layer | Azure decision | Purpose | Output |
| --- | --- | --- | --- |
| Sources | SIS, LMS, ERP, campus systems | Source-owned files, APIs, database extracts, and events | Raw operational evidence |
| Ingestion | Azure Data Factory / Fabric Data Factory, Azure Event Hubs, managed identities | Scheduled extraction, event capture, schema checks, freshness logging | Landing records with source metadata |
| Raw lake | ADLS Gen2, Event Hubs Capture, immutable folders | Preserve original records exactly as received | Replayable bronze evidence |
| Curated lakehouse | Microsoft Fabric OneLake Lakehouses with Delta tables | Identity resolution, type-2 history, quality contracts, deduplication | Trusted silver tables |
| Feature products | Fabric SQL analytics endpoint, notebooks, data warehouse where useful | Student-term facts, point-in-time weekly features, closed outcome labels | Gold training and scoring tables |
| ML lifecycle | Azure Machine Learning registry, pipelines, Responsible AI dashboard | Training, validation, calibration, fairness analysis, model cards, threshold release | Versioned model and threshold package |
| Advisor delivery | Power BI / Fabric app, Teams notification, row-level security | Risk bands, reason codes, score trend, intervention workflow | Audited support queue |
| Control plane | Microsoft Purview, Entra ID, Key Vault, Azure Monitor, Log Analytics, Azure Policy | Catalog, lineage, access, secrets, alerts, drift, deployment governance | Evidence for operations and audit |

## Azure Service Blueprint

| Need from the brief | Azure service choice | Why it fits |
| --- | --- | --- |
| Four systems with different files, APIs, events, and update frequencies | Azure Data Factory / Fabric Data Factory plus Azure Event Hubs | Handles scheduled copy, metadata activities, API/file extraction, and high-frequency streams without forcing one ingestion pattern. |
| Raw evidence that can be replayed | ADLS Gen2 with hierarchical namespace and lifecycle policies | Provides scalable, low-cost lake storage with directory semantics, file-level security, and retention tiers. |
| Governed analytics layer | Microsoft Fabric OneLake Lakehouses using bronze, silver, and gold layers | Keeps lakehouse layers separate, uses Delta tables for curated data, and exposes SQL/Power BI access for analytics. |
| Lineage and data discovery | Microsoft Purview Data Map and Unified Catalog | Captures metadata across disparate sources and gives data stewards a searchable governance layer. |
| Model training, release, and fairness review | Azure Machine Learning with Responsible AI dashboard | Supports model versioning and gives reviewers error analysis, fairness metrics, feature importance, and cohort analysis. |
| Advisor-facing restricted view | Power BI/Fabric app with Entra groups and row-level security | Lets advisors see only assigned students or faculties while preserving semantic-model governance. |
| Secrets, encryption, and access control | Microsoft Entra ID, managed identities, Key Vault, private endpoints where required | Reduces password handling and centralizes role-based access, credentials, and keys. |
| Production monitoring | Azure Monitor, Log Analytics, Application Insights, alert rules | Tracks pipeline health, source freshness, schema failures, endpoint latency, score drift, and access events. |

## Microsoft Reference Basis

| Design point | Microsoft reference used | How it informs this answer |
| --- | --- | --- |
| Unified analytics foundation | [Microsoft Fabric overview](https://learn.microsoft.com/en-us/fabric/fundamentals/microsoft-fabric-overview) | Supports the OneLake/Fabric lakehouse choice for a university-wide data spine. |
| Bronze, silver, and gold data layers | [Medallion lakehouse architecture](https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion) | Supports preserving raw records, promoting validated history, and publishing gold feature products. |
| Stream capture for LMS and campus signals | [Azure Event Hubs Capture](https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview) | Supports capturing high-frequency events into Blob Storage or ADLS for long-term retention and batch processing. |
| Pipeline lineage and auditability | [Connect Azure Data Factory to Microsoft Purview](https://learn.microsoft.com/en-us/azure/purview/how-to-link-azure-data-factory) | Supports linking ingestion and transformation evidence into Purview lineage. |
| Responsible model review | [Azure ML Responsible AI dashboard](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-responsible-ai-dashboard) | Supports attaching interpretability, error analysis, and review artifacts to registered models. |
| Advisor-scoped access | [Power BI row-level security](https://learn.microsoft.com/en-us/power-bi/enterprise/service-admin-rls) | Supports Entra-backed row-level restrictions in the advisor queue. |

## Source System Integration

| Source | Typical content | Ingestion pattern | Key risk | Control |
| --- | --- | --- | --- | --- |
| SIS | Enrolment status, programme, faculty, grades, personal details, graduation, withdrawal, leave history. | Azure Data Factory connector, database extract, or secure file landing into ADLS Gen2 bronze. | Current-state fields overwrite historical truth. | Type-2 status and programme history with `valid_from`, `valid_to`, and `recorded_at`. |
| LMS | Logins, course access, assignment submissions, forum activity, video watch time, quiz attempts. | Event Hubs for activity streams where available; otherwise Fabric/Data Factory incremental pulls normalized to daily facts. | Noisy behavior differs by course design and study mode. | Course/week normalization, source freshness checks, and missingness indicators. |
| ERP | Financial aid, tuition payment status, scholarships, balances, overdue days. | Daily vendor snapshot through Data Factory with managed identity or Key Vault-backed credential. | Different identifiers and delayed payment updates. | Identity map, effective-dated snapshots, and `available_at` timestamps. |
| Campus | Library access, WiFi presence, building entry events. | Event Hubs Capture to ADLS Gen2, or daily aggregated privacy-preserving file extract. | Physical presence is not equally meaningful for remote, commuter, placement, or part-time students. | Aggregate to low-granularity engagement features and calibrate by study mode/programme. |

Each feed lands with `source_system`, `source_record_id`, `event_time`, `available_at`, `ingested_at`, schema version, source URI, pipeline run ID, and Purview asset reference. Breaking schema changes are quarantined; non-breaking changes are profiled before promotion to trusted tables.

## Data Model

| Table | Key columns | Role |
| --- | --- | --- |
| `identity_map` | `canonical_student_id`, `source_system`, `source_person_id`, `valid_from`, `valid_to`, `match_confidence` | Joins fragmented systems safely. |
| `student_status_history` | `status`, `faculty`, `programme`, `valid_from`, `valid_to`, `recorded_at` | Prevents current-state leakage. |
| `ingestion_audit` | `pipeline_run_id`, `source_system`, `source_uri`, `schema_version`, `record_count`, `watermark`, `purview_asset_id` | Links data products back to Azure pipeline evidence. |
| `fact_enrollment_term` | `term_id`, `credits_registered`, `prior_gpa`, `academic_standing` | Academic baseline. |
| `fact_lms_activity_daily` | `activity_date`, `login_count`, `course_views`, `assignment_due`, `assignment_submitted` | Digital engagement. |
| `fact_financial_snapshot` | `status_date`, `outstanding_balance_nok`, `payment_overdue_days`, `aid_status` | Financial friction. |
| `fact_campus_activity_daily` | `event_date`, `building_entry_count`, `library_entry_count`, `wifi_minutes` | Aggregated physical engagement. |
| `feature_student_week` | `as_of_date`, `term_week`, feature columns, `feature_snapshot_hash` | Model input. |
| `risk_prediction` | `prediction_id`, `model_version`, `risk_score`, `risk_band`, `top_reasons` | Advisor output. |
| `access_audit` | `viewer_user_id`, `purpose`, `timestamp`, `prediction_id`, `fields_returned` | Accountability. |

## Point-In-Time Features

Every row is built as of `t = as_of_date`.

<div class="equation-stack">
  <section class="equation-panel">
    <h3>Leakage Rule</h3>
    <div class="math-expr">usable(record, t) = 1[event_time &le; t and available_at &le; t]</div>
  </section>
  <section class="equation-panel">
    <h3>14-Day LMS Activity</h3>
    <div class="math-expr">lms_logins_14d(i,t) = &sum; login_count<sub>i,d</sub> for d in [t - 13, t]</div>
  </section>
  <section class="equation-panel">
    <h3>Days Since LMS</h3>
    <div class="math-expr">days_since_lms(i,t) = t - <span class="math-fn">max</span>(activity_date<sub>i</sub> where login_count<sub>i,d</sub> &gt; 0)</div>
  </section>
  <section class="equation-panel">
    <h3>Assessment Gap</h3>
    <div class="math-expr">missing_assignments(i,t) = <span class="math-fn">max</span>(due_to_date<sub>i,t</sub> - submitted_to_date<sub>i,t</sub>, 0)</div>
  </section>
  <section class="equation-panel">
    <h3>Submission Rate</h3>
    <div class="math-expr">submission_rate(i,t) = submitted_to_date<sub>i,t</sub> / <span class="math-fn">max</span>(due_to_date<sub>i,t</sub>, 1)</div>
  </section>
  <section class="equation-panel">
    <h3>Campus Drop</h3>
    <div class="math-expr">campus_drop(i,t) = campus_days_first_14d<sub>i</sub> - campus_days_14d<sub>i,t</sub></div>
  </section>
  <section class="equation-panel">
    <h3>Financial Signal</h3>
    <div class="math-expr">overdue_flag(i,t) = 1[payment_overdue_days<sub>i,t</sub> &gt; 0]</div>
  </section>
</div>

Keep labels separate until the outcome window closes. Store each feature snapshot so predictions can be reproduced exactly.

## Model And Thresholding

Start in Azure Machine Learning with calibrated logistic regression or explainable gradient boosted trees. Optimize for useful advisor prioritization, not only aggregate accuracy.

<div class="equation-stack">
  <section class="equation-panel">
    <h3>Precision At Capacity</h3>
    <div class="math-expr">Precision@C = true_positives_in_top_C / C</div>
  </section>
  <section class="equation-panel">
    <h3>Recall At Capacity</h3>
    <div class="math-expr">Recall@C = true_positives_in_top_C / all_actual_positives</div>
  </section>
  <section class="equation-panel">
    <h3>Lead Time</h3>
    <div class="math-expr">lead_time<sub>i</sub> = outcome_date<sub>i</sub> - first_red_or_amber_score_date<sub>i</sub></div>
  </section>
  <section class="equation-panel">
    <h3>Calibration Bin Error</h3>
    <div class="math-expr">cal_error<sub>b</sub> = | <span class="math-fn">mean</span>(Y<sub>i</sub> in bin b) - <span class="math-fn">mean</span>(r<sub>i,t</sub> in bin b) |</div>
  </section>
  <section class="equation-panel">
    <h3>Expected Calibration Error</h3>
    <div class="math-expr">ECE = &sum;<sub>b</sub> (n<sub>b</sub> / N) cal_error<sub>b</sub></div>
  </section>
</div>

Use thresholds tied to advisor capacity, for example top 5 percent red and next 15 percent amber. Register each model and threshold package together in Azure ML, then review thresholds each term before release.

## Fairness And Calibration

Protected attributes are retained for audit, not shown as advisor reason codes.

<div class="equation-stack">
  <section class="equation-panel">
    <h3>Flag Rate By Group</h3>
    <div class="math-expr">flag_rate<sub>g</sub> = count(r<sub>i,t</sub> &ge; &tau;<sub>amber</sub> and group<sub>i</sub> = g) / count(group<sub>i</sub> = g)</div>
  </section>
  <section class="equation-panel">
    <h3>False Positive Rate</h3>
    <div class="math-expr">FPR<sub>g</sub> = FP<sub>g</sub> / (FP<sub>g</sub> + TN<sub>g</sub>)</div>
  </section>
  <section class="equation-panel">
    <h3>False Negative Rate</h3>
    <div class="math-expr">FNR<sub>g</sub> = FN<sub>g</sub> / (FN<sub>g</sub> + TP<sub>g</sub>)</div>
  </section>
  <section class="equation-panel">
    <h3>Recall Parity Gap</h3>
    <div class="math-expr">gap_recall = <span class="math-fn">max</span><sub>g</sub>(recall<sub>g</sub>) - <span class="math-fn">min</span><sub>g</sub>(recall<sub>g</sub>)</div>
  </section>
  <section class="equation-panel">
    <h3>Calibration Gap</h3>
    <div class="math-expr">gap_calibration = <span class="math-fn">max</span><sub>g</sub>(ECE<sub>g</sub>) - <span class="math-fn">min</span><sub>g</sub>(ECE<sub>g</sub>)</div>
  </section>
</div>

Audit by gender, age band, international status, first-generation status, faculty, programme, and study mode. Use Azure ML Responsible AI outputs for cohort analysis and model explanations, then store the signed-off fairness report beside the model card. Also monitor reason-code distributions to catch proxy discrimination.

## Advisor Workflow

The advisor view should show only decision-support evidence:

- Student, programme, risk band, score range, and score trend.
- Top three actionable reasons with source freshness.
- Recommended support route: academic check-in, financial-aid referral, study-skills support, or wellbeing referral.
- Contact status, notes, and intervention outcome.
- "Not relevant" feedback, captured for monitoring.

Do not automate punitive, academic, or financial decisions from the score.

## Controls And Operations

| Control | Production rule |
| --- | --- |
| Freshness | Azure Monitor alert when a Data Factory/Fabric pipeline misses SLA or an Event Hubs stream falls behind. |
| Completeness | Compare expected vs received enrollment and event counts by faculty and term in validation tables. |
| Identity | Track unmatched IDs and low-confidence matches; route exceptions to data stewardship. |
| Schema drift | Block breaking column, type, or semantic changes before silver promotion. |
| Score drift | Monitor score distribution, feature drift, and reason-code mix by term, faculty, programme, and study mode. |
| Governance | Register assets and lineage in Purview; version model, features, data, thresholds, reason-code logic, and approvals. |
| Security | Pseudonymous modeling, restricted PII, encryption, private access where needed, Entra groups, Key Vault secrets, and Power BI/Fabric row-level advisor access. |
| Audit | Store advisor access, data-product lineage, model release approval, and threshold approval in immutable audit tables. |

## Brief Coverage Map

| Brief requirement | Where the solution addresses it |
| --- | --- |
| Four independent systems with different formats, update frequencies, APIs, files, and events. | Azure Data Factory/Fabric Data Factory, Event Hubs, source contracts, bronze ingestion metadata, schema drift controls, and per-source integration patterns. |
| Inconsistent identifiers across systems. | Canonical `identity_map` with source IDs, validity windows, and match confidence. |
| Student status changes over time. | Type-2 `student_status_history`, effective-dated finance snapshots, and point-in-time joins. |
| Predict before official records show the issue. | Weekly in-semester scoring, engagement/finance/campus features, and lead-time evaluation. |
| Train only on information available at prediction time. | `event_time <= t` and `available_at <= t` rule, feature snapshots, and closed label windows. |
| Diverse population and different engagement patterns. | Segmented baselines, protected-attribute audit, calibration checks, and reason-code distribution monitoring. |
| Explain why a student was flagged red. | Risk band, score range, top actionable reasons, source freshness, feature snapshot hash, and model version. |
| Show who accessed student data. | `access_audit` table with viewer, role, purpose, timestamp, fields returned, and prediction ID. |
| Keep working after launch. | Azure Monitor, Log Analytics, Purview, freshness, completeness, identity, schema drift, score drift, governance, security, rollback, and monitoring controls. |
| Make assumptions and trade-offs explicit. | Assumptions table and trade-off table, including batch vs streaming, interpretability, global/local model, and protected-attribute use. |

## Trade-Offs

| Decision | Recommendation | Why |
| --- | --- | --- |
| Batch vs streaming | Data Factory/Fabric daily batch plus Event Hubs capture for high-frequency streams | Good enough for advisor intervention while preserving the option for near-real-time signals later. |
| Model complexity | Interpretable first | Easier to defend, calibrate, and explain. |
| Global vs local model | One global model with context features | More stable at launch; local models need more data. |
| Raw vs aggregate exposure | Keep raw restricted, expose aggregates | Preserves audit while reducing privacy risk. |
| Protected attributes | Use for audit only | Needed to detect bias; not appropriate as reason codes. |
| Fabric vs custom app first | Power BI/Fabric advisor view first, custom workflow only if needed | Faster to govern with Entra groups, RLS, semantic models, and audit before building bespoke software. |

## Roadmap And Demo

1. Weeks 0-4: DPIA, lawful basis, Azure landing zone, residency decision, outcome definition, access model, data contracts.
2. Weeks 4-10: configure ADLS Gen2/Fabric workspaces, ingest SIS/LMS/ERP/campus, identity map, type-2 history.
3. Weeks 10-16: point-in-time Fabric feature products, labels, validation, first Azure ML model.
4. Weeks 16-22: Power BI/Fabric advisor pilot with two faculties, RLS, feedback capture, intervention outcomes.
5. Weeks 22-30: Responsible AI review, calibration, Azure Monitor alerts, Purview lineage, model registry, operational hardening.

Run the prototype:

```powershell
python .\student_outcome_platform_demo.py --out outputs --students 1200 --seed 42
```

Key outputs: `outputs/demo_dashboard.html`, `outputs/advisor_risk_list.csv`, `outputs/feature_student_week_and_predictions.csv`, `outputs/fairness_report.csv`, `outputs/model_validation_metrics.csv`, and `outputs/access_audit_sample.csv`.
