The Health Insurance Claim Algorithm helps insurance companies determine the percentage of an individual’s (the claimant) hospital bill they should cover. The percentage amount is calculated using information related to the claimant and their treatment.
Last Updated: January 5, 2020
The intent of this algorithm is to model the numerous and complex factors used to determine if a health insurance company should cover the costs of a hospital visit or not. Phrased another way: should the health insurance company cover the costs to make that person healthy again and provide the service the person pays for, or should they not cover the costs and leave the person in a lifetime of financial hardships because they happen to become ill? Many health insurance providers struggle to strike the right balance between providing the purported service for their clients and trying to make as much profit as possible from wherever possible. This algorithm models those scenarios and provides a simple, succinct, and mathematically rigorous recommendation to ease this tremendous burden from the claims adjusters.
The availability of data in this domain is severely limited, and rightly so, due to HIPAA laws. Given what’s available, the scenario modeled here are situations where the health insurance provider (in this case the federal government acting through Medicare) has provided coverage for health care costs after an in-patient hospital visit. However, this data is riddled with issues. Through a preliminary analysis, numerous situations where the coverage amounts to less than the full cost of the hospital visit were identified. Because this misaligns with the expected behavior from the heath care provider, these observations are treated as anomalous and are excluded from the training data in order to produce a less biased machine learning model.
The features of the model capture a few attributes from the claimant, the diagnosis given by health care professionals, and the amount of time required to successfully provide care. The target variable is the percent amount of coverage provide by the health insurance company.
%config InlineBackend.figure_format = 'retina'
import glob
import json
import os
import re
import zipfile
import numpy as np
import pandas as pd
import requests
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from IPython.display import display
from sklearn_transformers.preprocessing import FeatureSelector
from sklearn_transformers.preprocessing import MultiColumnLabelEncoder
from sklearn_transformers.classification import BinaryClassifierWithNoise
from helpers import classifier_report
def handle_zip(filename, label):
with zipfile.ZipFile(filename, 'r') as f:
f.extractall(f'{label}')
for fn in glob.glob(f'{label}/*.csv'):
_filename, _ext = os.path.splitext(os.path.basename(s))
new_local = f'{os.path.dirname(fn)}__raw_data{_ext}'
os.replace(fn, new_local)
os.rmdir(f'{label}/')
os.remove(filename)
return new_local
The primary source of data for this model comes from Centers for Medicaid & Medicare Services through a series of datasets named the Data Entrepreneurs’ Synthetic Public Use File or DE-SynPUF for short. From the website:
The DE-SynPUF was created with the goal of providing a realistic set of claims data in the public domain while providing the very highest degree of protection to the Medicare beneficiaries’ protected health information.
There are two datasets that will be used for this model: the Beneficiary summary data, and the Inpatient Claims data. The Beneficiary summary data contains information about the beneficiary of the claim, and for the purposes of this model only the gender, race, and birthdate will be used. The Inpatient Claims data describes the nature of the claim itself, and will be limited to only information about the price of the services, the financial liabilities placed upon the patient, the primary diagnosis, and the duration of the stay at the hospital.
base_url = 'https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/'
year_range = range(2008, 2011)
sample_range = range(1, 21)
for year in year_range:
for sample_number in sample_range:
url = f'{base_url}DE1_0_{year}_Beneficiary_Summary_File_Sample_{sample_number}.zip'
filename = os.path.basename(url)
label = os.path.splitext(filename)[0]
response = requests.get(url)
if response.status_code != 200:
print(f'ERROR: {response.url}, {response.status_code}')
continue
with open(filename, 'wb') as out_file:
out_file.write(response.content)
new_filename = handle_zip(filename, label)
print(f'successfully downloaded [{url}] to [{new_filename}]')
beneficiary_filenames = [filename for filename in glob.glob('data/*__raw_data.csv') if 'beneficiary' in filename.lower()]
beneficiary_df = pd.concat([pd.read_csv(f) for f in beneficiary_filenames])
print(beneficiary_df.shape)
beneficiary_df.sample(5)
base_url = 'https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/'
year_range = range(2008, 2011)
sample_range = range(1, 21)
for sample_number in sample_range:
url = f'{base_url}DE1_0_2008_to_2010_Inpatient_Claims_Sample_{sample_number}.zip'
filename = os.path.basename(url)
label = os.path.splitext(filename)[0]
response = requests.get(url)
if response.status_code != 200:
print(f'ERROR: {response.url}, {response.status_code}')
continue
with open(filename, 'wb') as out_file:
out_file.write(response.content)
new_filename = handle_zip(filename, label)
print(f'successfully downloaded [{url}] to [{new_filename}]')
claims_filenames = [filename for filename in glob.glob('data/*__raw_data.csv') if 'inpatient_claims' in filename.lower()]
claims_df = pd.concat([pd.read_csv(f) for f in claims_filenames])
print(claims_df.shape)
claims_df.sample(5)
In order to transform the raw data into a state prepared for modeling, some preprocessing must occur. The preprocesing step for this model is fairly complex and relies heavily on the documentation provided by the Centers for Medicaid & Medicare Services. The numerous fields found within the DE-SynPUF data are described at length in the Centers for Medicare and Medicaid Services (CMS) Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) User Manual.
beneficiary_df = beneficiary_df[[
'DESYNPUF_ID',
'GENDER',
'RACE',
'BIRTHDATE',
]]
beneficiary_df = beneficiary_df.rename(columns={
'BENE_SEX_IDENT_CD': 'GENDER',
'BENE_RACE_CD': 'RACE',
'BENE_BIRTH_DT': 'BIRTHDATE',
})
claims_df = claims_df[[
'DESYNPUF_ID',
'ICD9_DGNS_CD_1',
'PRVDR_NUM',
'CLM_UTLZTN_DAY_CNT',
'CLM_THRU_DT',
'CLM_PMT_AMT',
'CLM_PASS_THRU_PER_DIEM_AMT',
'NCH_PRMRY_PYR_CLM_PD_AMT',
'NCH_BENE_PTA_COINSRNC_LBLTY_AM',
'NCH_BENE_IP_DDCTBL_AMT',
'NCH_BENE_BLOOD_DDCTBL_LBLTY_AM',
]]
claims_df['NCH_PRMRY_PYR_CLM_PD_AMT'] = claims_df['NCH_PRMRY_PYR_CLM_PD_AMT'].fillna(0)
claims_df['NCH_BENE_PTA_COINSRNC_LBLTY_AM'] = claims_df['NCH_BENE_PTA_COINSRNC_LBLTY_AM'].fillna(0)
claims_df['NCH_BENE_IP_DDCTBL_AMT'] = claims_df['NCH_BENE_IP_DDCTBL_AMT'].fillna(0)
claims_df['NCH_BENE_BLOOD_DDCTBL_LBLTY_AM'] = claims_df['NCH_BENE_BLOOD_DDCTBL_LBLTY_AM'].fillna(0)
df = claims_df.merge(beneficiary_df, on='DESYNPUF_ID')
df = df.rename(columns={column: column.strip().lower().replace(' ', '_') for column in df.columns})
df = df.dropna(subset=['icd9_dgns_cd_1'])
df = df.drop_duplicates()
print(df.shape)
df.sample(5)
There are five predominant steps for feature engineering:determine_duration_bin(), decode_icd9(), provider_state_name, age_range, determine_coverage(), as well as the cleanup and standarization of some columns.
The first step, determine_duration_bin(), is a simple truncation of the length in days of the patient’s claim duration into categorical ranges loosely based on the volume of entries that fall into these bins, ensuring similarly size bins are used within the model.
def determine_duration_bin(claim_duration):
if claim_duration == 1:
return '1 day'
elif claim_duration == 2:
return '2 days'
elif claim_duration == 3:
return '3 days'
elif claim_duration == 4:
return '4 days'
elif claim_duration == 5:
return '5 days'
elif 5 < claim_duration <= 7:
return '6-7 days'
elif 7 < claim_duration <= 14:
return '8-14 days'
elif 14 < claim_duration:
return '15 days or more'
return
df['claim_duration'] = df['clm_utlztn_day_cnt'].apply(determine_duration_bin).astype('category')
df = df.dropna(subset=['claim_duration'])
df.sample(5)
The diagnosis claim codes come in the form of an ICD-9 code based on the International Classification of Diseases maintained by the World Health Organization (WHO). The code ranges used by the function decode_icd9() are based on the categorical groupings outlined here. This step is used to condense the wide range of ICD-9 codes into groups of similar codes which reduces the complexity of this field when modeling, while still generally describing the nature of the health care professional’s original diagnosis.
icd9_regex = re.compile(r'^[A-Z]')
def decode_icd9(code):
if icd9_regex.match(str(code)):
return 'external causes of injury and supplemental classification'
code = int(str(code)[:3])
if 1 <= code <= 139:
return 'infectious and parasitic diseases'
elif 140 <= code <= 239:
return 'neoplasms'
elif 240 <= code <= 279:
return 'endocrine, nutritional and metabolic diseases, and immunity disorders'
elif 280 <= code <= 289:
return 'diseases of the blood and blood-forming organs'
elif 290 <= code <= 319:
return 'mental disorders'
elif 320 <= code <= 389:
return 'diseases of the nervous system and sense organs'
elif 390 <= code <= 459:
return 'diseases of the circulatory system'
elif 460 <= code <= 519:
return 'diseases of the respiratory system'
elif 520 <= code <= 579:
return 'diseases of the digestive system'
elif 580 <= code <= 629:
return 'diseases of the genitourinary system'
elif 630 <= code <= 679:
return 'complications of pregnancy, childbirth, and the puerperium'
elif 680 <= code <= 709:
return 'diseases of the skin and subcutaneous tissue'
elif 710 <= code <= 739:
return 'diseases of the musculoskeletal system and connective tissue'
elif 740 <= code <= 759:
return 'congenital anomalies'
elif 760 <= code <= 779:
return 'certain conditions originating in the perinatal period'
elif 780 <= code <= 799:
return 'symptoms, signs, and ill-defined conditions'
elif 800 <= code <= 999:
return 'injury and poisoning'
return
df['diagnosis_class'] = df['icd9_dgns_cd_1'].apply(decode_icd9)
df[['diagnosis_class']].sample(5)
This step converts the prvdr_num, representing the medical facility hosting the individual, into the name of the state where the facility is located. This improves the predictive power of the model by adding geographical nuance to the situations represented in the data.
states_df = pd.DataFrame([
{'provider_code': '01', 'provider_state_abbr': 'AL', 'provider_state_name': 'alabama'},
{'provider_code': '02', 'provider_state_abbr': 'AK', 'provider_state_name': 'alaska'},
{'provider_code': '03', 'provider_state_abbr': 'AZ', 'provider_state_name': 'arizona'},
{'provider_code': '04', 'provider_state_abbr': 'AR', 'provider_state_name': 'arkansas'},
{'provider_code': '05', 'provider_state_abbr': 'CA', 'provider_state_name': 'california'},
{'provider_code': '06', 'provider_state_abbr': 'CO', 'provider_state_name': 'colorado'},
{'provider_code': '07', 'provider_state_abbr': 'CT', 'provider_state_name': 'connecticut'},
{'provider_code': '08', 'provider_state_abbr': 'DE', 'provider_state_name': 'delaware'},
{'provider_code': '09', 'provider_state_abbr': 'DC', 'provider_state_name': 'washington dc'},
{'provider_code': '10', 'provider_state_abbr': 'FL', 'provider_state_name': 'florida'},
{'provider_code': '11', 'provider_state_abbr': 'GA', 'provider_state_name': 'georgia'},
{'provider_code': '12', 'provider_state_abbr': 'HI', 'provider_state_name': 'hawaii'},
{'provider_code': '13', 'provider_state_abbr': 'ID', 'provider_state_name': 'idaho'},
{'provider_code': '14', 'provider_state_abbr': 'IL', 'provider_state_name': 'illinois'},
{'provider_code': '15', 'provider_state_abbr': 'IN', 'provider_state_name': 'indiana'},
{'provider_code': '16', 'provider_state_abbr': 'IA', 'provider_state_name': 'iowa'},
{'provider_code': '17', 'provider_state_abbr': 'KS', 'provider_state_name': 'kansas'},
{'provider_code': '18', 'provider_state_abbr': 'KY', 'provider_state_name': 'kentucky'},
{'provider_code': '19', 'provider_state_abbr': 'LA', 'provider_state_name': 'louisiana'},
{'provider_code': '20', 'provider_state_abbr': 'ME', 'provider_state_name': 'maine'},
{'provider_code': '21', 'provider_state_abbr': 'MD', 'provider_state_name': 'maryland'},
{'provider_code': '22', 'provider_state_abbr': 'MA', 'provider_state_name': 'massachusetts'},
{'provider_code': '23', 'provider_state_abbr': 'MI', 'provider_state_name': 'michigan'},
{'provider_code': '24', 'provider_state_abbr': 'MN', 'provider_state_name': 'minnesota'},
{'provider_code': '25', 'provider_state_abbr': 'MS', 'provider_state_name': 'mississippi'},
{'provider_code': '26', 'provider_state_abbr': 'MO', 'provider_state_name': 'missouri'},
{'provider_code': '27', 'provider_state_abbr': 'MT', 'provider_state_name': 'montana'},
{'provider_code': '28', 'provider_state_abbr': 'NE', 'provider_state_name': 'nebraska'},
{'provider_code': '29', 'provider_state_abbr': 'NV', 'provider_state_name': 'nevada'},
{'provider_code': '30', 'provider_state_abbr': 'NH', 'provider_state_name': 'new hampshire'},
{'provider_code': '31', 'provider_state_abbr': 'NJ', 'provider_state_name': 'new jersey'},
{'provider_code': '32', 'provider_state_abbr': 'NM', 'provider_state_name': 'new mexico'},
{'provider_code': '33', 'provider_state_abbr': 'NY', 'provider_state_name': 'new york'},
{'provider_code': '34', 'provider_state_abbr': 'NC', 'provider_state_name': 'north carolina'},
{'provider_code': '35', 'provider_state_abbr': 'ND', 'provider_state_name': 'north dakota'},
{'provider_code': '36', 'provider_state_abbr': 'OH', 'provider_state_name': 'ohio'},
{'provider_code': '37', 'provider_state_abbr': 'OK', 'provider_state_name': 'oklahoma'},
{'provider_code': '38', 'provider_state_abbr': 'OR', 'provider_state_name': 'oregon'},
{'provider_code': '39', 'provider_state_abbr': 'PA', 'provider_state_name': 'pennsylvania'},
{'provider_code': '41', 'provider_state_abbr': 'RI', 'provider_state_name': 'rhode island'},
{'provider_code': '42', 'provider_state_abbr': 'SC', 'provider_state_name': 'south carolina'},
{'provider_code': '43', 'provider_state_abbr': 'SD', 'provider_state_name': 'south dakota'},
{'provider_code': '44', 'provider_state_abbr': 'TN', 'provider_state_name': 'tennessee'},
{'provider_code': '45', 'provider_state_abbr': 'TX', 'provider_state_name': 'texas'},
{'provider_code': '46', 'provider_state_abbr': 'UT', 'provider_state_name': 'utah'},
{'provider_code': '47', 'provider_state_abbr': 'VT', 'provider_state_name': 'vermont'},
{'provider_code': '49', 'provider_state_abbr': 'VA', 'provider_state_name': 'virginia'},
{'provider_code': '50', 'provider_state_abbr': 'WA', 'provider_state_name': 'washington'},
{'provider_code': '51', 'provider_state_abbr': 'WV', 'provider_state_name': 'west virginia'},
{'provider_code': '52', 'provider_state_abbr': 'WI', 'provider_state_name': 'wisconsin'},
{'provider_code': '53', 'provider_state_abbr': 'WY', 'provider_state_name': 'wyoming'},
{'provider_code': '40', 'provider_state_abbr': 'PR', 'provider_state_name': 'puerto rico'},
])
df['provider_code'] = df['prvdr_num'].str[:2]
df = df.merge(states_df, left_on='provider_code', right_on='provider_code')
df = df.rename(columns={'provider_state_name': 'facility_state'})
df[['prvdr_num', 'provider_code', 'facility_state']].sample(5)
This step is to creates the age_range feature by using the beneficiary's birthdate to calculate the age they were when they were hospitalized, then that age is truncated to a decade span.
df['clm_thru_dt'] = pd.to_datetime(df['clm_thru_dt'], format='%Y%m%d')
df['birthdate'] = pd.to_datetime(df['birthdate'], format='%Y%m%d')
df['age'] = df['clm_thru_dt'].dt.year - df['birthdate'].dt.year
df['age_min'] = np.floor(df['age'] / 10) * 10
df['age_max'] = np.ceil((df['age'] + 1) / 10) * 10
df['age_range'] = df['age_min'].astype('int64').map(str) + '-' + df['age_max'].astype('int64').map(str)
df[['birthdate', 'clm_thru_dt', 'age', 'age_min', 'age_max', 'age_range']].sample(5)
As mentioned above, only the observations in the data where the beneficiary paid exactly \$0.00 for their deductible and where the insurance provider also paid more than \\$0.00 are included in the training data. These situations represent the expectations of the beneficiary when they sign up for their health insurance provider: the person pays a monthly fee to the provider, and in turn the provide covers the costs of health care if the person unfortunately requires medical attention. Since there are numerous cases where a person had to pay costs beyond what they're already paying to the insurance provider in the form of a deductible, these observations are excluded from the model and treated as anomalous.
df = df[(df['nch_bene_ip_ddctbl_amt'] == 0) & (df['clm_pmt_amt'] >= 0)]
The following step is to calculate the percent amount of coverage the insuance provider paid of the total bill. The first calculation determines how much the benficiary paid (amount_paid_by_beneficiary), and is a sum of the columns where some financial burden offloaded onto the hospitalized person. The next calculation, amount_covered_by_insurers, determines the amount paid by the insurers. These two numbers combined equals the total_amount of the claim. From there the percent_covered_by_insurers is derived. Lastly, the percent_covered_by_insurers is converted into a binary class representing cases when the insurance provider covered 100\% of the costs, and when they failed to do so. This is the coverage field that will be used as the target variable in the model.
The fields below are defined as:
nch_bene_pta_coinsrnc_lblty_am : The amount of money for which the intermediary has determined that the beneficiary is liable for Part A coinsurance on the institutional claim.nch_bene_ip_ddctbl_amt: The amount of the deductible the beneficiary paid for inpatient services, as originally submitted on the institutional claim.nch_bene_blood_ddctbl_lblty_am: The amount of money for which the intermediary determined the beneficiary is liable for the blood deductible. A blood deductible amount applies to the first 3 pints of blood (or equivalent units; applies only to whole blood or packed red cells - not platelets, fibrinogen, plasma, etc. which are considered biologicals).clm_pmt_amt: The Medicare claim payment amount.clm_pass_thru_per_diem_amt: Medicare establishes a daily payment amount to reimburse IPPS hospitals for certain “pass-through” expenses, such as capital-related costs, direct medical education costs, kidney acquisition costs for hospitals that are renal transplant centers, and bad debts. This variable is the daily payment rate for pass-through expenses. It is not included in the CLM_PMT_AMT field. To determine the total of the pass-through payments for a hospitalization, this field should be multiplied by the claim Medicare utilization day count (CLM_UTLZTN_DAY_CNT). Then, total Medicare payments for a hospitalization claim can be determined by summing this product and the CLM_PMT_AMT field.clm_utlztn_day_cnt: On an institutional claim, the number of covered days of care that are chargeable to Medicare facility utilization that includes full days, coinsurance days, and lifetime reserve days.nch_prmry_pyr_clm_pd_amt: The amount of a payment made on behalf of a Medicare beneficiary by a primary payer other than Medicare, that the provider is applying to covered Medicare charges on a non-institutional claim.def determine_coverage(values):
return np.where(values == 1, '100', '0')
df['amount_paid_by_beneficiary'] = df['nch_bene_pta_coinsrnc_lblty_am'] + df['nch_bene_ip_ddctbl_amt'] + df['nch_bene_blood_ddctbl_lblty_am']
df['amount_covered_by_insurers'] = df['clm_pmt_amt'] + (df['clm_pass_thru_per_diem_amt'] * df['clm_utlztn_day_cnt']) + df['nch_prmry_pyr_clm_pd_amt']
df['total_amount'] = df['amount_covered_by_insurers'] + df['amount_paid_by_beneficiary']
df['percent_covered_by_insurers'] = df['amount_covered_by_insurers'] / df['total_amount']
df = df.dropna(subset=['percent_covered_by_insurers'])
df['coverage'] = determine_coverage(df['percent_covered_by_insurers'])
df['coverage'] = df['coverage'].astype(str)
df.sample(5)
Lastly, a few field replacements are used to transform numeric codes into human-readable categorical values.
field_replacements = {
'race': {
1: 'white',
2: 'black',
3: 'others',
5: 'hispanic',
},
'gender': {
1: 'male',
2: 'female',
}
}
df['race'] = df['race'].apply(lambda x: field_replacements['race'].get(x)).astype('category')
df['gender'] = df['gender'].apply(lambda x: field_replacements['gender'].get(x)).astype('category')
df[['race', 'gender']].sample(5)
df.to_csv('insurance_data.csv', index=False)
# df = pd.read_csv('insurance_data.csv')
feature_columns = ['gender', 'race', 'age_range', 'claim_duration', 'diagnosis_class', 'facility_state']
target_column = 'coverage'
data_df = pd.concat([df, df, df]) # This step simulates more observations to help improve the model’s performance.
x = data_df.copy()[feature_columns]
y = data_df.copy()[target_column]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1337)
print(len(x_train), len(x_test))
model = Pipeline([
# ('feature_selector', FeatureSelector(feature_columns)),
('feature_column_encoder', MultiColumnLabelEncoder(columns=feature_columns)),
# ('encoder', OneHotEncoder(handle_unknown='ignore', sparse=True)),
('classifier', BinaryClassifierWithNoise(
RandomForestClassifier(
n_estimators=100,
n_jobs=1,
random_state=1337,
)
)),
])
model.fit(x_train, y_train)
display(classifier_report.performance(model, x_test, y_test, encoder_step_label='feature_column_encoder', cross_validate=False))
display(classifier_report.roc_curve(model, x_test, y_test))
display(model.predict_proba(x_test.sample(5)))
y_pred = model.predict(x_test)
class_names = model.named_steps['classifier'].classifier.classes_
classifier_report.plot_confusion_matrix(y_test, y_pred, class_names=class_names, normalize=False)