Catches the denial before it happens. Open-source tooling for healthcare operators who need early-warning payer behavior intelligence.
The statistical and ML methodology behind Upstream's Care Intelligence Platform. Public CMS data only. No production model weights. No proprietary payer data. No PHI.
Reference code that demonstrates how Upstream detects payer behavior shifts, predicts denials, and clusters payers into behavioral archetypes. Written so the broader healthcare RCM community can learn from the methodology, validate the techniques, and contribute back.
This is the open methodology. Not the production engine.
Production code. The production system uses 40 plus features, live behavioral graphs, real time 835 ingestion, and trained model weights that stay private. This repository teaches the patterns. The platform applies them at scale across the operator network.
Works across ABA, SNF, PT/OT, dental, dialysis, imaging, home health, and behavioral health.
| Module | What it does |
|---|---|
carc_rarc_utils.py |
Parse and categorize CMS CARC denial codes. Plain English descriptions, corrective actions, and regulatory basis for every code. |
denial_prediction_reference.py |
CatBoost denial predictor with temporal cross validation and SHAP explainability. 15 features (production uses 40 plus). |
drift_detection_reference.py |
Chi-square plus Kolmogorov-Smirnov tests for detecting when a payer changes their denial behavior. The statistical core of Upstream's DriftWatch engine. |
denial_clustering.py |
K-means payer clustering into behavioral archetypes. Aggressive Denier, Slow Payer, Prompt Payer, Underpayer. |
dental_denial_clustering.py |
Specialty specific clustering using CDT code patterns. Compatible with ADA claim data. |
aba_auth_predictor.py |
Gradient boosted authorization approval predictor. ABA reference implementation. Use as a template for other specialties (dental, SNF, PT/OT, imaging). |
payer_behavior_detector.py |
Composite payer fingerprinting using denial rate trends, payment velocity, and adjudication pattern shifts. Powers the check_payer_behavior MCP tool. |
| Notebook | What it teaches |
|---|---|
01_payer_clustering_walkthrough.ipynb |
End to end. Load CMS data, engineer features, run K-means clustering, visualize payer archetypes. |
02_patient_propensity_tutorial.ipynb |
End to end. Build a gradient boosted collectibility scorer from public claim attributes with SHAP explainability. |
No data is committed. See data/README.md for the public CMS datasets to download.
sample_claims_schema.py generates a 500 row synthetic CSV for local testing.
| Aspect | This repo | Production |
|---|---|---|
| Features | 15 public derivable features | 40 plus including payer behavioral graph and authorization policy state |
| Training data | CMS SynPUF (synthetic) | Live 835 remittance plus operator contributed signals |
| Model weights | Not included | Private |
| Drift detection | 3 statistical tests | 12 plus tests with payer specific thresholds |
| Clustering | Basic K-means, 5 features | Deep behavioral fingerprinting across the operator network, real time |
| Retraining | Manual | Automated weekly with data drift triggers |
pip install -r requirements.txtPython 3.10 or later recommended.
from reference.carc_rarc_utils import parse_carc, group_by_category
info = parse_carc("97")
print(info.description)
# "The benefit for this service is included in the payment/allowance
# for another service/procedure that has already been adjudicated."
print(info.corrective_action)
# "Review NCCI edits for the code pair. Add modifier 59 or X{EPSU}..."
info = parse_carc("CO-97")
# Same result. Handles prefixed codes from 835 files.
codes = ["97", "50", "97", "16", "97", "197"]
by_category = group_by_category(codes)
# {"bundled": [...], "non_covered": [...], "claim_information_missing": [...], "authorization": [...]}import pandas as pd
from reference.denial_clustering import cluster_payers, label_clusters
df = pd.read_csv("data/sample_claims.csv")
payer_profiles = cluster_payers(df, n_clusters=4)
labeled = label_clusters(payer_profiles)
# DataFrame with cluster labels:
# "Aggressive Denier", "Slow Payer", "Prompt Payer", "Underpayer"import pandas as pd
from reference.denial_prediction_reference import build_features, train, explain
df = pd.read_csv("data/sample_claims.csv")
X = build_features(df)
y = df["is_denied"]
model, metrics = train(X, y)
print(f"CV AUC: {metrics['cv_auc_mean']:.4f} +/- {metrics['cv_auc_std']:.4f}")
importance = explain(model, X.head(500))
print(importance.head(5))import pandas as pd
from reference.drift_detection_reference import detect_drift
baseline = pd.read_csv("data/aetna_90day_baseline.csv")
current = pd.read_csv("data/aetna_last_7days.csv")
report = detect_drift("Aetna", baseline, current)
if report.drift_detected:
for alert in report.alerts:
print(f"{alert.feature}: {alert.baseline_rate:.1%} -> {alert.current_rate:.1%} (p={alert.p_value:.4f})")import pandas as pd
from reference.aba_auth_predictor import predict_approval
# ABA example. Use the same pattern for other specialties by adjusting feature names.
auth_request = {
"payer": "unitedhealthcare",
"diagnosis_primary": "F84.0",
"cpt_codes": ["97153", "97155"],
"requested_units_weekly": 20,
"patient_age": 7,
}
result = predict_approval(auth_request)
print(f"Approval probability: {result.probability:.1%}")
print(f"Top risk factors: {result.risk_factors}")- Download the 1 percent SynPUF sample from the CMS website (about 460MB).
- Map the SynPUF columns to the schema in
sample_claims_schema.py. - Run
notebooks/01_payer_clustering_walkthrough.ipynb.
See data/README.md for all supported public datasets.
Why publish this? The healthcare RCM community has historically operated as a black box. Vendors guard their methodology. Operators have no way to validate vendor claims. Upstream publishes the methodology so the community can learn from it, audit it, and contribute back.
Can I run this on my own claims data?
Yes, if you have the right tenant boundaries and authorizations. The code accepts any DataFrame matching the schema in sample_claims_schema.py. PHI handling is your responsibility.
Will Upstream open source the production system? No. The production system contains operator contributed network signals, payer behavioral graphs trained on real time data, and customer specific tenant logic that cannot be extracted without compromising the network. The methodology is open. The production application stays private.
How is this different from existing healthcare ML repos? Most existing repos use synthetic data without payer behavioral context. This repo treats payer behavior as the central object and provides the statistical tests Upstream uses to detect when behavior shifts.
Can I contribute?
Yes. See CONTRIBUTING.md. In scope: additional CARC and RARC codes, improved feature engineering, additional specialty templates for the auth predictor, notebooks demonstrating the reference implementations on additional public datasets, bug fixes, and documentation. Out of scope: anything requiring proprietary payer data or production internals.
MIT. See LICENSE.
Upstream's production model weights, payer behavioral graph, and operator contributed signals stay private.
- upstream.cx Care Intelligence Platform
- upstream-mcp Bring Upstream into Claude
- upstream-skills Claude Code skill pack
- upstream.cx/developers API documentation
- Newsletter Monthly network signals digest
Built by Upstream Intelligence. Read the methodology at engine.upstream.cx. Pioneer Program: upstream.cx/pioneer.
upstream.cx · hello@upstream.cx
Care Intelligence Platform.