【金融风控系列】_[3]_贷款违约识别


本文围绕Kaggle的Home Credit Default Risk赛题展开,利用客户申请表等7张表数据构建模型预测客户还款能力。通过数据清洗、特征工程,融合多表信息生成衍生特征,经LightGBM模型训练,最终线上评分为0.78277,为信用记录不足人群的贷款评估提供参考。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

Home Credit Default Risk(家庭信用违约风险)

该赛题来自 KAGGLE,仅用作学习交流


由于信用记录不足或不存在,许多人往往被划分为低信用借贷人而难以获得贷款。 为了确保这些人群获得贷款,Home Credit公司利用替代数据(包括电信和交易信息)预测客户的还款能力。

Home Credit一共提供了7张表,一共218个字段,其中训练集样本约31万(逾期8%),测试集样本约5万。


信息表

application_train/test 客户申请表

包含了

  • 目标变量(客户是否违约-0/1变量)
  • 客户申请贷款信息(贷款类型, 贷款总额, 年金)
  • 客户基本信息(性别, 年龄, 家庭, 学历, 职业, 行业, 居住地情况)
  • 客户财务信息(年收入, 房/车情况)
  • 申请时提供的资料等.

bureau/bureau_balance 由其他金融机构提供给征信中心的客户信用记录历史(月数据)

包含了客户在征信中心的

  • 信用记录,
  • 违约金额,
  • 违约时间等.

以时间序列(按行)的形式进行记录.

POS_CASH_balance 客户在Home Credit数据库中POS(point of sales)和现金贷款历史(月数据)

包含了客户

  • 已付款情况
  • 未付款情况

credit_card_balance 客户在Home Credit数据库中信用卡的snapshot历史(月数据)

包含了客户

  • 消费次数
  • 消费金额

等情况.

previous_application 客户先前的申请记录

包含了客户所有历史申请记录(申请信息, 申请结果等).

installments_payments 客户先前信用卡的还款记录

包含了客户的还款情况

  • 还款日期
  • 是否逾期
  • 还款金额
  • 是否欠款等

参考:

[1] https://zhuanlan.zhihu.com/p/43541825

[2] https://www.kaggle.com/xucheng/cv-7993-private-score-7996/

[3] https://zhuanlan.zhihu.com/p/40790434

[4] https://www.kaggle.com/tahmidnafi/cse499

[5] https://blog.csdn.net/zhangchen2449/article/details/83338978

主要字段表

Field Description
SK_ID_CURR 此次申请的ID
TARGET 申请人本次申请的还款风险:1-风险较高;0-风险较低
NAME_CONTRACT_TYPE 贷款类型:cash(现金)还是revolving(周转金,一次申请,多次循环提取)
CODE_GENDER 申请人性别
FLAG_OWN_CAR 申请人是否有车
FLAG_OWN_REALTY 申请人是否有房
CNT_CHILDREN 申请人子女个数
AMT_INCOME_TOTAL 申请人收入状况
AMT_CREDIT 此次申请的贷款金额
AMT_ANNUITY 贷款年金
AMT_GOODS_PRICE 如果是消费贷款,改字段表示商品的实际价格
NAME_TYPE_SUITE 申请人此次申请的陪同人员
NAME_INCOME_TYPE 申请人收入类型
NAME_EDUCATION_TYPE 申请人受教育程度
NAME_FAMILY_STATUS 申请人婚姻状况
NAME_HOUSING_TYPE 申请人居住状况(租房,已购房,和父母一起住等)
REGION_POPULATION_RELATIVE 申请人居住地人口密度,已标准化
DAYS_BIRTH 申请人出生日(距离申请当日的天数,负值)
DAYS_EMPLOYED 申请人当前工作的工作年限(距离申请当日的天数,负值)
DAYS_REGISTRATION 申请人最近一次修改注册信息的时间(距离申请当日的天数,负值)
DAYS_ID_PUBLISH 申请人最近一次修改申请贷款的身份证明文件的时间(距离申请当日的天数,负值)
FLAG_MOBIL 申请人是否提供个人电话(1-yes,0-no)
FLAG_EMP_PHONE 申请人是否提供家庭电话(1-yes,0-no)
FLAG_WORK_PHONE 申请人是否提供工作电话(1-yes,0-no)
FLAG_CONT_MOBILE 申请人个人电话是否能拨通(1-yes,0-no)
FLAG_EMAIL 申请人是否提供电子邮箱(1-yes,0-no)
OCCUPATION_TYPE 申请人职务
REGION_RATING_CLIENT ben公司对申请人居住区域的评分等级(1,2,3)
REGION_RATING_CLIENT_W_CITY 在考虑所在城市的情况下,ben公司对申请人居住区域的评分等级(1,2,3)
WEEKDAY_APPR_PROCESS_START 申请人发起申请日是星期几
HOUR_APPR_PROCESS_START 申请人发起申请的hour
REG_REGION_NOT_LIVE_REGION 申请人提供的的永久地址和联系地址是否匹配(1-不匹配,2-匹配,区域级别的)
REG_REGION_NOT_WORK_REGION 申请人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,区域级别的)
LIVE_REGION_NOT_WORK_REGION 申请人提供的的联系地址和工作地址是否匹配(1-不匹配,2-匹配,区域级别的)
REG_CITY_NOT_LIVE_CITY 申请人提供的的永久地址和联系地址是否匹配(1-不匹配,2-匹配,城市级别的)
REG_CITY_NOT_WORK_CITY 申请人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,城市级别的)
LIVE_CITY_NOT_WORK_CITY 申请人提供的的联系地址和工作地址是否匹配(1-不匹配,2-匹配,城市级别的)
ORGANIZATION_TYPE 申请人工作所属组织类型
EXT_SOURCE_1 外部数据源1的标准化评分
EXT_SOURCE_2 外部数据源2的标准化评分
EXT_SOURCE_3 外部数据源3的标准化评分
APARTMENTS_AVG EMERGENCYSTATE_MODE 申请人居住环境各项指标的标准化评分
OBS_30_CNT_SOCIAL_CIRC LE DEF_60_CNT_SOCIAL_CIRCLE 这部分字段含义没看懂
DAYS_LAST_PHONE_CHANGE 申请人最近一次修改手机号码的时间(距离申请当日的天数,负值)
FLAG_DOCUMENT_2 FLAG_DOCUMENT_21 申请人是否额外提供了文件2,3,4. . .21
AMT_REQ_CREDIT_BUREAU_HOUR 申请人发起申请前1个小时以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_DAY 申请人发起申请前一天以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_WEEK 申请人发起申请前一周以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_MONTH 申请人发起申请前一个月以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_QRT 申请人发起申请前一个季度以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_YEAR 申请人发起申请前一年以内,被查询征信的次数
In [20]
#!unzip -q -o data/data105246/home_credit_default_risk.zip -d /home/aistudio/data
       
unzip:  cannot find or open data/data104475/IEEE_CIS_Fraud_Detection.zip, data/data104475/IEEE_CIS_Fraud_Detection.zip.zip or data/data104475/IEEE_CIS_Fraud_Detection.zip.ZIP.
        In [22]
# 安装依赖包!pip install xgboost
!pip install lightgbm
       
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: xgboost in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.3.3)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.6.3)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.20.3)
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1)
Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.24.2)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.20.3)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.6.3)
Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.36.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
        In [23]
import osimport gcimport numpy as npimport pandas as pdfrom scipy.stats import kurtosisfrom sklearn.metrics import roc_auc_scorefrom sklearn.preprocessing import MinMaxScalerfrom sklearn.impute import SimpleImputerfrom sklearn.linear_model import LogisticRegressionimport matplotlib.pyplot as pltimport seaborn as snsimport warningsfrom sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFoldimport xgboost as xgbfrom xgboost import XGBClassifier
warnings.simplefilter(action='ignore', category=FutureWarning)from lightgbm import LGBMClassifier
    In [24]
DATA_DIRECTORY = "./data"df_train = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_train.csv'))
df_test = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_test.csv'))
df = df_train.append(df_test)del df_train, df_test; gc.collect()
       
39
                In [25]
df = df[df['AMT_INCOME_TOTAL'] < 20000000]
df = df[df['CODE_GENDER'] != 'XNA']
df['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
df['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)
    In [26]
def get_age_group(days_birth):
    age_years = -days_birth / 365
    if age_years < 27: return 1
    elif age_years < 40: return 2
    elif age_years < 50: return 3
    elif age_years < 65: return 4
    elif age_years < 99: return 5
    else: return 0
    In [27]
docs = [f for f in df.columns if 'FLAG_DOC' in f]
df['DOCUMENT_COUNT'] = df[docs].sum(axis=1)
df['NEW_DOC_KURT'] = df[docs].kurtosis(axis=1)
df['AGE_RANGE'] = df['DAYS_BIRTH'].apply(lambda x: get_age_group(x))
    In [28]
df['EXT_SOURCES_PROD'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2'] * df['EXT_SOURCE_3']
df['EXT_SOURCES_WEIGHTED'] = df.EXT_SOURCE_1 * 2 + df.EXT_SOURCE_2 * 1 + df.EXT_SOURCE_3 * 3np.warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')for function_name in ['min', 'max', 'mean', 'nanmedian', 'var']:
    feature_name = 'EXT_SOURCES_{}'.format(function_name.upper())
    df[feature_name] = eval('np.{}'.format(function_name))(
        df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']], axis=1)
    In [29]
df['CREDIT_TO_ANNUITY_RATIO'] = df['AMT_CREDIT'] / df['AMT_ANNUITY']
df['CREDIT_TO_GOODS_RATIO'] = df['AMT_CREDIT'] / df['AMT_GOODS_PRICE']
df['ANNUITY_TO_INCOME_RATIO'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL']
df['CREDIT_TO_INCOME_RATIO'] = df['AMT_CREDIT'] / df['AMT_INCOME_TOTAL']
df['INCOME_TO_EMPLOYED_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_EMPLOYED']
df['INCOME_TO_BIRTH_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_BIRTH']    
df['EMPLOYED_TO_BIRTH_RATIO'] = df['DAYS_EMPLOYED'] / df['DAYS_BIRTH']
df['ID_TO_BIRTH_RATIO'] = df['DAYS_ID_PUBLISH'] / df['DAYS_BIRTH']
df['CAR_TO_BIRTH_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_BIRTH']
df['CAR_TO_EMPLOYED_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_EMPLOYED']
df['PHONE_TO_BIRTH_RATIO'] = df['DAYS_LAST_PHONE_CHANGE'] / df['DAYS_BIRTH']
    In [30]
def do_mean(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].mean().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
    In [31]
def do_median(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].median().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
    In [32]
def do_std(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].std().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
    In [33]
def do_sum(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].sum().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
    In [34]
group = ['ORGANIZATION_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_RANGE', 'CODE_GENDER']
df = do_median(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_MEDIAN')
df = do_std(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_STD')
df = do_mean(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_MEAN')
df = do_std(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_STD')
df = do_mean(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_MEAN')
df = do_std(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_STD')
df = do_mean(df, group, 'AMT_CREDIT', 'GROUP_CREDIT_MEAN')
df = do_mean(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_MEAN')
df = do_std(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_STD')
    In [35]
def label_encoder(df, categorical_columns=None):
    if not categorical_columns:
        categorical_columns = [col for col in df.columns if df[col].dtype == 'object']    for col in categorical_columns:
        df[col], uniques = pd.factorize(df[col])    return df, categorical_columns
    In [36]
def drop_application_columns(df):
    drop_list = [        'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'HOUR_APPR_PROCESS_START',        'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'FLAG_CONT_MOBILE', 'FLAG_EMAIL', 'FLAG_PHONE',        'FLAG_OWN_REALTY', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',        'REG_CITY_NOT_WORK_CITY', 'OBS_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',        'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_YEAR', 
        'COMMONAREA_MODE', 'NONLIVINGAREA_MODE', 'ELEVATORS_MODE', 'NONLIVINGAREA_AVG',        'FLOORSMIN_MEDI', 'LANDAREA_MODE', 'NONLIVINGAREA_MEDI', 'LIVINGAPARTMENTS_MODE',        'FLOORSMIN_AVG', 'LANDAREA_AVG', 'FLOORSMIN_MODE', 'LANDAREA_MEDI',        'COMMONAREA_MEDI', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'BASEMENTAREA_AVG',        'BASEMENTAREA_MODE', 'NONLIVINGAPARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 
        'LIVINGAPARTMENTS_AVG', 'ELEVATORS_AVG', 'YEARS_BUILD_MEDI', 'ENTRANCES_MODE',        'NONLIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'LIVINGAPARTMENTS_MEDI',        'YEARS_BUILD_MODE', 'YEARS_BEGINEXPLUATATION_AVG', 'ELEVATORS_MEDI', 'LIVINGAREA_MEDI',        'YEARS_BEGINEXPLUATATION_MODE', 'NONLIVINGAPARTMENTS_AVG', 'HOUSETYPE_MODE',        'FONDKAPREMONT_MODE', 'EMERGENCYSTATE_MODE'
    ]    for doc_num in [2,4,5,6,7,9,10,11,12,13,14,15,16,17,19,20,21]:
        drop_list.append('FLAG_DOCUMENT_{}'.format(doc_num))
    df.drop(drop_list, axis=1, inplace=True)    return df
    In [37]
df, le_encoded_cols = label_encoder(df, None)
df = drop_application_columns(df)
    In [38]
df = pd.get_dummies(df)
    In [39]
bureau = pd.read_csv(os.path.join(DATA_DIRECTORY, 'bureau.csv'))
    In [40]
bureau['CREDIT_DURATION'] = -bureau['DAYS_CREDIT'] + bureau['DAYS_CREDIT_ENDDATE']
bureau['ENDDATE_DIF'] = bureau['DAYS_CREDIT_ENDDATE'] - bureau['DAYS_ENDDATE_FACT']
bureau['DEBT_PERCENTAGE'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_CREDIT_SUM_DEBT']
bureau['DEBT_CREDIT_DIFF'] = bureau['AMT_CREDIT_SUM'] - bureau['AMT_CREDIT_SUM_DEBT']
bureau['CREDIT_TO_ANNUITY_RATIO'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_ANNUITY']
    In [41]
def one_hot_encoder(df, categorical_columns=None, nan_as_category=True):
    original_columns = list(df.columns)    if not categorical_columns:
        categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
    categorical_columns = [c for c in df.columns if c not in original_columns]    return df, categorical_columns
    In [42]
def group(df_to_agg, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
    agg_df = df_to_agg.groupby(aggregate_by).agg(aggregations)
    agg_df.columns = pd.Index(['{}{}_{}'.format(prefix, e[0], e[1].upper())                               for e in agg_df.columns.tolist()])    return agg_df.reset_index()
    In [43]
def group_and_merge(df_to_agg, df_to_merge, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
    agg_df = group(df_to_agg, prefix, aggregations, aggregate_by= aggregate_by)    return df_to_merge.merge(agg_df, how='left', on= aggregate_by)
    In [44]
def get_bureau_balance(path, num_rows= None):
    bb = pd.read_csv(os.path.join(path, 'bureau_balance.csv'))
    bb, categorical_cols = one_hot_encoder(bb, nan_as_category= False)    # Calculate rate for each category with decay
    bb_processed = bb.groupby('SK_ID_BUREAU')[categorical_cols].mean().reset_index()    # Min, Max, Count and mean duration of payments (months)
    agg = {'MONTHS_BALANCE': ['min', 'max', 'mean', 'size']}
    bb_processed = group_and_merge(bb, bb_processed, '', agg, 'SK_ID_BUREAU')    del bb; gc.collect()    return bb_processed
    In [45]
bureau, categorical_cols = one_hot_encoder(bureau, nan_as_category= False)
bureau = bureau.merge(get_bureau_balance(DATA_DIRECTORY), how='left', on='SK_ID_BUREAU')
bureau['STATUS_12345'] = 0for i in range(1,6):
    bureau['STATUS_12345'] += bureau['STATUS_{}'.format(i)]
    In [46]
features = ['AMT_CREDIT_MAX_OVERDUE', 'AMT_CREDIT_SUM_OVERDUE', 'AMT_CREDIT_SUM',    'AMT_CREDIT_SUM_DEBT', 'DEBT_PERCENTAGE', 'DEBT_CREDIT_DIFF', 'STATUS_0', 'STATUS_12345']
agg_length = bureau.groupby('MONTHS_BALANCE_SIZE')[features].mean().reset_index()
agg_length.rename({feat: 'LL_' + feat for feat in features}, axis=1, inplace=True)
bureau = bureau.merge(agg_length, how='left', on='MONTHS_BALANCE_SIZE')del agg_length; gc.collect()
       
39
                In [47]
BUREAU_AGG = {    'SK_ID_BUREAU': ['nunique'],    'DAYS_CREDIT': ['min', 'max', 'mean'],    'DAYS_CREDIT_ENDDATE': ['min', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean', 'sum'],    'AMT_ANNUITY': ['mean'],    'DEBT_CREDIT_DIFF': ['mean', 'sum'],    'MONTHS_BALANCE_MEAN': ['mean', 'var'],    'MONTHS_BALANCE_SIZE': ['mean', 'sum'],    'STATUS_0': ['mean'],    'STATUS_1': ['mean'],    'STATUS_12345': ['mean'],    'STATUS_C': ['mean'],    'STATUS_X': ['mean'],    'CREDIT_ACTIVE_Active': ['mean'],    'CREDIT_ACTIVE_Closed': ['mean'],    'CREDIT_ACTIVE_Sold': ['mean'],    'CREDIT_TYPE_Consumer credit': ['mean'],    'CREDIT_TYPE_Credit card': ['mean'],    'CREDIT_TYPE_Car loan': ['mean'],    'CREDIT_TYPE_Mortgage': ['mean'],    'CREDIT_TYPE_Microloan': ['mean'],    'LL_AMT_CREDIT_SUM_OVERDUE': ['mean'],    'LL_DEBT_CREDIT_DIFF': ['mean'],    'LL_STATUS_12345': ['mean'],
}

BUREAU_ACTIVE_AGG = {    'DAYS_CREDIT': ['max', 'mean'],    'DAYS_CREDIT_ENDDATE': ['min', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM': ['max', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'],    'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean'],    'DAYS_CREDIT_UPDATE': ['min', 'mean'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'CREDIT_TO_ANNUITY_RATIO': ['mean'],    'MONTHS_BALANCE_MEAN': ['mean', 'var'],    'MONTHS_BALANCE_SIZE': ['mean', 'sum'],
}

BUREAU_CLOSED_AGG = {    'DAYS_CREDIT': ['max', 'var'],    'DAYS_CREDIT_ENDDATE': ['max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM_OVERDUE': ['mean'],    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['max', 'sum'],    'DAYS_CREDIT_UPDATE': ['max'],    'ENDDATE_DIF': ['mean'],    'STATUS_12345': ['mean'],
}

BUREAU_LOAN_TYPE_AGG = {    'DAYS_CREDIT': ['mean', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['mean', 'max'],    'AMT_CREDIT_SUM': ['mean', 'max'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'max'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'DAYS_CREDIT_ENDDATE': ['max'],
}

BUREAU_TIME_AGG = {    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM_OVERDUE': ['mean'],    'AMT_CREDIT_SUM': ['max', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'STATUS_0': ['mean'],    'STATUS_12345': ['mean'],
}
    In [48]
agg_bureau = group(bureau, 'BUREAU_', BUREAU_AGG)
active = bureau[bureau['CREDIT_ACTIVE_Active'] == 1]
agg_bureau = group_and_merge(active,agg_bureau,'BUREAU_ACTIVE_',BUREAU_ACTIVE_AGG)
closed = bureau[bureau['CREDIT_ACTIVE_Closed'] == 1]
agg_bureau = group_and_merge(closed,agg_bureau,'BUREAU_CLOSED_',BUREAU_CLOSED_AGG)del active, closed; gc.collect()for credit_type in ['Consumer credit', 'Credit card', 'Mortgage', 'Car loan', 'Microloan']:
    type_df = bureau[bureau['CREDIT_TYPE_' + credit_type] == 1]
    prefix = 'BUREAU_' + credit_type.split(' ')[0].upper() + '_'
    agg_bureau = group_and_merge(type_df, agg_bureau, prefix, BUREAU_LOAN_TYPE_AGG)    del type_df; gc.collect()for time_frame in [6, 12]:
    prefix = "BUREAU_LAST{}M_".format(time_frame)
    time_frame_df = bureau[bureau['DAYS_CREDIT'] >= -30*time_frame]
    agg_bureau = group_and_merge(time_frame_df, agg_bureau, prefix, BUREAU_TIME_AGG)    del time_frame_df; gc.collect()
    In [49]
sort_bureau = bureau.sort_values(by=['DAYS_CREDIT'])
gr = sort_bureau.groupby('SK_ID_CURR')['AMT_CREDIT_MAX_OVERDUE'].last().reset_index()
gr.rename({'AMT_CREDIT_MAX_OVERDUE': 'BUREAU_LAST_LOAN_MAX_OVERDUE'}, inplace=True)
agg_bureau = agg_bureau.merge(gr, on='SK_ID_CURR', how='left')
agg_bureau['BUREAU_DEBT_OVER_CREDIT'] = \
    agg_bureau['BUREAU_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_AMT_CREDIT_SUM_SUM']
agg_bureau['BUREAU_ACTIVE_DEBT_OVER_CREDIT'] = \
    agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_SUM']
    In [50]
df = pd.merge(df, agg_bureau, on='SK_ID_CURR', how='left')del agg_bureau, bureau
gc.collect()
       
39
                In [51]
prev = pd.read_csv(os.path.join(DATA_DIRECTORY, 'previous_application.csv'))
pay = pd.read_csv(os.path.join(DATA_DIRECTORY, 'installments_payments.csv'))
    In [52]
PREVIOUS_AGG = {    'SK_ID_PREV': ['nunique'],    'AMT_ANNUITY': ['min', 'max', 'mean'],    'AMT_DOWN_PAYMENT': ['max', 'mean'],    'HOUR_APPR_PROCESS_START': ['min', 'max', 'mean'],    'RATE_DOWN_PAYMENT': ['max', 'mean'],    'DAYS_DECISION': ['min', 'max', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    'DAYS_TERMINATION': ['max'],    # Engineered features
    'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'],    'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean', 'var'],    'DOWN_PAYMENT_TO_CREDIT': ['mean'],
}

PREVIOUS_ACTIVE_AGG = {    'SK_ID_PREV': ['nunique'],    'SIMPLE_INTERESTS': ['mean'],    'AMT_ANNUITY': ['max', 'sum'],    'AMT_APPLICATION': ['max', 'mean'],    'AMT_CREDIT': ['sum'],    'AMT_DOWN_PAYMENT': ['max', 'mean'],    'DAYS_DECISION': ['min', 'mean'],    'CNT_PAYMENT': ['mean', 'sum'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'AMT_PAYMENT': ['sum'],    'INSTALMENT_PAYMENT_DIFF': ['mean', 'max'],    'REMAINING_DEBT': ['max', 'mean', 'sum'],    'REPAYMENT_RATIO': ['mean'],
}
PREVIOUS_LATE_PAYMENTS_AGG = {    'DAYS_DECISION': ['min', 'max', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}

PREVIOUS_LOAN_TYPE_AGG = {    'AMT_CREDIT': ['sum'],    'AMT_ANNUITY': ['mean', 'max'],    'SIMPLE_INTERESTS': ['min', 'mean', 'max', 'var'],    'APPLICATION_CREDIT_DIFF': ['min', 'var'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    'DAYS_DECISION': ['max'],    'DAYS_LAST_DUE_1ST_VERSION': ['max', 'mean'],    'CNT_PAYMENT': ['mean'],
}

PREVIOUS_TIME_AGG = {    'AMT_CREDIT': ['sum'],    'AMT_ANNUITY': ['mean', 'max'],    'SIMPLE_INTERESTS': ['mean', 'max'],    'DAYS_DECISION': ['min', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}

PREVIOUS_APPROVED_AGG = {    'SK_ID_PREV': ['nunique'],    'AMT_ANNUITY': ['min', 'max', 'mean'],    'AMT_CREDIT': ['min', 'max', 'mean'],    'AMT_DOWN_PAYMENT': ['max'],    'AMT_GOODS_PRICE': ['max'],    'HOUR_APPR_PROCESS_START': ['min', 'max'],    'DAYS_DECISION': ['min', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    'DAYS_TERMINATION': ['mean'],    # Engineered features
    'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'],    'APPLICATION_CREDIT_DIFF': ['max'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    # The following features are only for approved applications
    'DAYS_FIRST_DRAWING': ['max', 'mean'],    'DAYS_FIRST_DUE': ['min', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    'DAYS_LAST_DUE': ['max', 'mean'],    'DAYS_LAST_DUE_DIFF': ['min', 'max', 'mean'],    'SIMPLE_INTERESTS': ['min', 'max', 'mean'],
}

PREVIOUS_REFUSED_AGG = {    'AMT_APPLICATION': ['max', 'mean'],    'AMT_CREDIT': ['min', 'max'],    'DAYS_DECISION': ['min', 'max', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean', 'var'],    'APPLICATION_CREDIT_RATIO': ['min', 'mean'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}
    In [53]
ohe_columns = [    'NAME_CONTRACT_STATUS', 'NAME_CONTRACT_TYPE', 'CHANNEL_TYPE',    'NAME_TYPE_SUITE', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',    'NAME_PRODUCT_TYPE', 'NAME_CLIENT_TYPE']
prev, categorical_cols = one_hot_encoder(prev, ohe_columns, nan_as_category= False)
    In [54]
prev['APPLICATION_CREDIT_DIFF'] = prev['AMT_APPLICATION'] - prev['AMT_CREDIT']
prev['APPLICATION_CREDIT_RATIO'] = prev['AMT_APPLICATION'] / prev['AMT_CREDIT']
prev['CREDIT_TO_ANNUITY_RATIO'] = prev['AMT_CREDIT']/prev['AMT_ANNUITY']
prev['DOWN_PAYMENT_TO_CREDIT'] = prev['AMT_DOWN_PAYMENT'] / prev['AMT_CREDIT']
total_payment = prev['AMT_ANNUITY'] * prev['CNT_PAYMENT']
prev['SIMPLE_INTERESTS'] = (total_payment/prev['AMT_CREDIT'] - 1)/prev['CNT_PAYMENT']
    In [55]
approved = prev[prev['NAME_CONTRACT_STATUS_Approved'] == 1]
active_df = approved[approved['DAYS_LAST_DUE'] == 365243]
active_pay = pay[pay['SK_ID_PREV'].isin(active_df['SK_ID_PREV'])]
active_pay_agg = active_pay.groupby('SK_ID_PREV')[['AMT_INSTALMENT', 'AMT_PAYMENT']].sum()
active_pay_agg.reset_index(inplace= True)
active_pay_agg['INSTALMENT_PAYMENT_DIFF'] = active_pay_agg['AMT_INSTALMENT'] - active_pay_agg['AMT_PAYMENT']
active_df = active_df.merge(active_pay_agg, on= 'SK_ID_PREV', how= 'left')
active_df['REMAINING_DEBT'] = active_df['AMT_CREDIT'] - active_df['AMT_PAYMENT']
active_df['REPAYMENT_RATIO'] = active_df['AMT_PAYMENT'] / active_df['AMT_CREDIT']
active_agg_df = group(active_df, 'PREV_ACTIVE_', PREVIOUS_ACTIVE_AGG)
active_agg_df['TOTAL_REPAYMENT_RATIO'] = active_agg_df['PREV_ACTIVE_AMT_PAYMENT_SUM']/\
                                            active_agg_df['PREV_ACTIVE_AMT_CREDIT_SUM']del active_pay, active_pay_agg, active_df; gc.collect()
       
0
                In [56]
prev['DAYS_FIRST_DRAWING'].replace(365243, np.nan, inplace= True)
prev['DAYS_FIRST_DUE'].replace(365243, np.nan, inplace= True)
prev['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True)
prev['DAYS_LAST_DUE'].replace(365243, np.nan, inplace= True)
prev['DAYS_TERMINATION'].replace(365243, np.nan, inplace= True)
    In [57]
prev['DAYS_LAST_DUE_DIFF'] = prev['DAYS_LAST_DUE_1ST_VERSION'] - prev['DAYS_LAST_DUE']
approved['DAYS_LAST_DUE_DIFF'] = approved['DAYS_LAST_DUE_1ST_VERSION'] - approved['DAYS_LAST_DUE']
    In [58]
categorical_agg = {key: ['mean'] for key in categorical_cols}
    In [59]
agg_prev = group(prev, 'PREV_', {**PREVIOUS_AGG, **categorical_agg})
agg_prev = agg_prev.merge(active_agg_df, how='left', on='SK_ID_CURR')del active_agg_df; gc.collect()
       
0
                In [60]
agg_prev = group_and_merge(approved, agg_prev, 'APPROVED_', PREVIOUS_APPROVED_AGG)
refused = prev[prev['NAME_CONTRACT_STATUS_Refused'] == 1]
agg_prev = group_and_merge(refused, agg_prev, 'REFUSED_', PREVIOUS_REFUSED_AGG)del approved, refused; gc.collect()
       
0
                In [61]
for loan_type in ['Consumer loans', 'Cash loans']:
    type_df = prev[prev['NAME_CONTRACT_TYPE_{}'.format(loan_type)] == 1]
    prefix = 'PREV_' + loan_type.split(" ")[0] + '_'
    agg_prev = group_and_merge(type_df, agg_prev, prefix, PREVIOUS_LOAN_TYPE_AGG)    del type_df; gc.collect()
    In [62]
pay['LATE_PAYMENT'] = pay['DAYS_ENTRY_PAYMENT'] - pay['DAYS_INSTALMENT']
pay['LATE_PAYMENT'] = pay['LATE_PAYMENT'].apply(lambda x: 1 if x > 0 else 0)
dpd_id = pay[pay['LATE_PAYMENT'] > 0]['SK_ID_PREV'].unique()
    In [63]
agg_dpd = group_and_merge(prev[prev['SK_ID_PREV'].isin(dpd_id)], agg_prev,                                    'PREV_LATE_', PREVIOUS_LATE_PAYMENTS_AGG)del agg_dpd, dpd_id; gc.collect()
       
0
                In [64]
for time_frame in [12, 24]:
    time_frame_df = prev[prev['DAYS_DECISION'] >= -30*time_frame]
    prefix = 'PREV_LAST{}M_'.format(time_frame)
    agg_prev = group_and_merge(time_frame_df, agg_prev, prefix, PREVIOUS_TIME_AGG)    del time_frame_df; gc.collect()del prev; gc.collect()
       
0
                In [65]
df = pd.merge(df, agg_prev, on='SK_ID_CURR', how='left')
    In [66]
train = df[df['TARGET'].notnull()]
test = df[df['TARGET'].isnull()]del df
gc.collect()
       
98
                In [67]
labels = train['TARGET']
test_lebels=test['TARGET']
train = train.drop(columns=['TARGET'])
test = test.drop(columns=['TARGET'])
    In [68]
feature = list(train.columns)

train.replace([np.inf, -np.inf], np.nan, inplace=True)
test.replace([np.inf, -np.inf], np.nan, inplace=True)
test_df = test.copy()
train_df = train.copy()
train_df['TARGET'] = labels
test_df['TARGET'] = test_lebels
    In [69]
imputer = SimpleImputer(strategy = 'median')
imputer.fit(train)
imputer.fit(test)
train = imputer.transform(train)
test = imputer.transform(test)
    In [70]
scaler = MinMaxScaler(feature_range = (0, 1))
scaler.fit(train)
scaler.fit(test)
train = scaler.transform(train)
test = scaler.transform(test)
    In [71]
from lightgbm import LGBMClassifier

lgbmc = LGBMClassifier()
lgbmc.fit(train, labels)
       
LGBMClassifier()
                In [72]
lgbm_pred = lgbmc.predict_proba(test)[:, 1]
    In [74]
submit = test_df[['SK_ID_CURR']]
submit['TARGET'] = lgbm_pred
    In [75]
submit.to_csv('lgbm.csv', index = False)
   

总结

数据的提交结果如下:(提交需要科学上网)

数据集 Home Credit Default Risk
线上评分 0.78277


# https  # 居住地  # 一个月  # 申请贷款  # 先前  # 数据库中  # 线上  # 包含了  # 当日  # 不匹配  # python  # 数据库  # default  # private  # 循环  # 征信  # red  # 邮箱  # ai 


相关栏目: 【 Google疑问12 】 【 Facebook疑问10 】 【 网络优化91478 】 【 技术知识72672 】 【 云计算0 】 【 GEO优化84317 】 【 优选文章0 】 【 营销推广36048 】 【 网络运营41350 】 【 案例网站102563 】 【 AI智能45237


相关推荐: 探索都市传说:追寻鳄鱼飞机怪物“Bombardino Crocodilo”  使用ChatGPT快速生成专辑封面:AI艺术创作指南  ChatGPT助力QA测试:提升效率与质量的终极指南  ChatGPT图像生成器完全指南:文化影响、伦理挑战与商业变革  ROBLOX Brookhaven:惊悚友谊与校园秘密(2025版)  Thesis AI:一键生成高质量学术论文的秘密武器  AI员工工具详解:添加与移除指南,提升效率  如何通过 DeepSeek 优化 Kubernetes 配置文件  3步教你用AI将你的博客文章改编成引人入胜的播客脚本  AI标语生成器:轻松打造品牌口号,提升品牌价值  AI驱动的Web应用测试:突破QA挑战,提升用户体验  探索古希腊之美:AI打造的绝|美女|神形象赏析  Canva AI 辅助 KDP 封面设计:轻松创建畅销书笔记本  DiagramMagic:AI驱动的在线图表生成器终极指南  理论框架写作指南:3步构建研究基石  图像分割技术详解:定义、类型、技术与应用  啦啦队女孩:青春活力与性感魅力的完美结合  Canva AI工具教程:动漫化图像、生成艺术与定制QR码  夸克AI怎样搜索医疗健康_夸克AI医疗频道与症状自查【技巧】  怎么使用网页版deepseek【教程】  快手本地生活AI如何预约景区火车票_快手AI本地生活抢票步骤【步骤】  AI QA 工程:通往人工智能质量保障工程师的职业道路  如何通过 DeepSeek 优化分布式存储系统架构  AI内容检测与优化:免费工具助你提升内容质量  AI 3D建模革命:免费生成高质量模型和纹理  淋巴按摩终极指南:在家打造紧致透亮肌肤  利用AI快速生成数组和枚举:详细指南与实用技巧  孩子作文写不出来?教你用AI引导孩子构思,写出优秀范文  5分钟教你用AI将任何文章改写成儿童易懂版  Gemini怎样用快捷指令_Gemini快捷指令使用教程【步骤】  ChatGPT官方网页端入口 ChatGPT官网快速登录方法  Vizeo AI视频生成器:无需技术,轻松打造营销利器  Excel Copilot:AI驱动的强大新功能与实用案例解析  Gacha Club反应视频解析:探索热门角色和独特剧情  ChatGPT背后的AI革命:OpenAI的崛起与Google的危机  如何用 ChatGPT 批量处理 Excel 复杂公式  百度AI搜索怎样设置搜索偏好_百度AI搜索偏好设置与个性化推荐【技巧】  文心一言辅助进行中文播客脚本起草教程  雷小兔ai智能写作怎么设置写作风格_雷小兔ai智能写作风格选择方法【指南】  宝可梦化石精灵大揭秘:晶灿钻石视角下的精灵演化  tofai官网正版入口 tofai网页版免费使用  利用MECLABS AI解决业务难题:实用指南  AI赋能软件测试:自动化、智能化与未来趋势  AI绘画工具怎么用_AI绘画工具使用方法详细指南【教程】  雷小兔ai智能写作如何生成日记_雷小兔ai智能写作日记模板调用【步骤】  今日头条AI怎样推荐抢票工具_今日头条AI抢票工具推荐算法与筛选【技巧】  AI游戏革命:文本驱动,无限可能  ChatGPT一键生成PPT怎么加目录_ChatGPTPPT目录添加【步骤】  如何用ChatGPT模拟面试并优化你的求职文书?  Kindroid AI:打造你的专属虚拟伙伴,开启AI社交新体验 

 2025-07-22

了解您产品搜索量及市场趋势,制定营销计划

同行竞争及网站分析保障您的广告效果

点击免费数据支持

提交您的需求,1小时内享受我们的专业解答。

南京市珐之弘网络技术有限公司


南京市珐之弘网络技术有限公司

南京市珐之弘网络技术有限公司专注海外推广十年,是谷歌推广.Facebook广告全球合作伙伴,我们精英化的技术团队为企业提供谷歌海外推广+外贸网站建设+网站维护运营+Google SEO优化+社交营销为您提供一站式海外营销服务。

 87067657

 13565296790

 87067657@qq.com

Notice

We and selected third parties use cookies or similar technologies for technical purposes and, with your consent, for other purposes as specified in the cookie policy.
You can consent to the use of such technologies by closing this notice, by interacting with any link or button outside of this notice or by continuing to browse otherwise.