Starter Code for DBSRA Practice Challange

Author: Gauransh Kumar¶

Note : Create a copy of the notebook and use the copy for submission. Go to File > Save a Copy in Drive to create a new copy

Downloading Dataset¶

Installing aicrowd-cli

In [1]:

!pip install aicrowd-cli
%load_ext aicrowd.magic

Requirement already satisfied: aicrowd-cli in /home/gauransh/anaconda3/lib/python3.8/site-packages (0.1.10)
Requirement already satisfied: pyzmq==22.1.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (22.1.0)
Requirement already satisfied: toml<1,>=0.10.2 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: requests-toolbelt<1,>=0.9.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (0.9.1)
Requirement already satisfied: rich<11,>=10.0.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (10.15.2)
Requirement already satisfied: tqdm<5,>=4.56.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (4.62.2)
Requirement already satisfied: click<8,>=7.1.2 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (7.1.2)
Requirement already satisfied: requests<3,>=2.25.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (2.26.0)
Requirement already satisfied: GitPython==3.1.18 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (3.1.18)
Requirement already satisfied: gitdb<5,>=4.0.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from GitPython==3.1.18->aicrowd-cli) (4.0.9)
Requirement already satisfied: smmap<6,>=3.0.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from gitdb<5,>=4.0.1->GitPython==3.1.18->aicrowd-cli) (5.0.0)
Requirement already satisfied: idna<4,>=2.5 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (3.1)
Requirement already satisfied: charset-normalizer~=2.0.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.10.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.26.6)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.10.0)
Requirement already satisfied: colorama<0.5.0,>=0.4.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.4.4)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.9.1)

In [2]:

%aicrowd login

Please login here: https://api.aicrowd.com/auth/u4Ud9bic2o225Kr6qw1hMK4WaJJFpuO_YYET-Q2E21g
Opening in existing browser session.
API Key valid
Saved API Key successfully!

In [3]:

!rm -rf data
!mkdir data
%aicrowd ds dl -c dbsra -o data

Importing Libraries¶

In this baseline, we will be using skleanr library to train the model and generate the predictions

In [4]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import AdaBoostClassifier
import os
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

Reading the dataset¶

Here, we will read the train.csv which contains both training samples & labels, and test.csv which contains testing samples.

In [5]:

# Reading the CSV
train_data_df = pd.read_csv("data/train.csv")
test_data_df = pd.read_csv("data/test.csv")

# train_data.shape, test_data.shape
display(train_data_df.head())
display(test_data_df.head())

	encounter_id	patient_nbr	race	gender	age	weight	admission_type_id	discharge_disposition_id	admission_source_id	time_in_hospital	...	citoglipton	insulin	glyburide-metformin	glipizide-metformin	glimepiride-pioglitazone	metformin-rosiglitazone	metformin-pioglitazone	change	diabetesMed	readmitted
0	15905856	4622157	AfricanAmerican	Female	[70-80)	?	1	1	7	2	...	No	Steady	No	No	No	No	No	No	Yes	1
1	192457734	93617955	Caucasian	Female	[90-100)	?	3	1	1	8	...	No	Down	No	No	No	No	No	Ch	Yes	1
2	242557524	82335384	Caucasian	Female	[80-90)	?	1	2	7	1	...	No	Steady	No	No	No	No	No	No	Yes	0
3	319561658	39177882	Caucasian	Male	[60-70)	?	3	1	6	6	...	No	Steady	No	No	No	No	No	Ch	Yes	0
4	106914300	18587601	?	Female	[70-80)	?	1	3	6	3	...	No	No	No	No	No	No	No	No	No	0

5 rows × 50 columns

	encounter_id	patient_nbr	race	gender	age	weight	admission_type_id	discharge_disposition_id	admission_source_id	time_in_hospital	...	examide	citoglipton	insulin	glyburide-metformin	glipizide-metformin	glimepiride-pioglitazone	metformin-rosiglitazone	metformin-pioglitazone	change	diabetesMed
0	110939484	19274094	Caucasian	Female	[70-80)	?	1	1	6	11	...	No	No	Steady	No	No	No	No	No	No	Yes
1	170328306	65634327	Caucasian	Male	[50-60)	?	1	1	1	1	...	No	No	No	No	No	No	No	No	No	Yes
2	245688426	100657359	Caucasian	Female	[60-70)	?	3	6	1	4	...	No	No	No	No	No	No	No	No	No	Yes
3	150826224	83144448	Caucasian	Male	[30-40)	?	2	1	1	12	...	No	No	No	No	No	No	No	No	No	Yes
4	135993852	65234214	AfricanAmerican	Female	[60-70)	?	1	2	7	1	...	No	No	No	No	No	No	No	No	No	Yes

5 rows × 49 columns

Data Preprocessing¶

In [6]:

# selecting column with only np.int64 dtype
# only for ease of model creation and baseline purpose
# must not be considered in real competition

train_cols = []
for column in train_data_df.columns:
    if train_data_df[column].dtype == np.int64:
        train_cols.append(column)
print(train_cols)

['encounter_id', 'patient_nbr', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses', 'readmitted']

In [7]:

# Separating data from the dataframe for final training
X = train_data_df[train_cols[:-1]].to_numpy()
y = train_data_df[train_cols[-1]].to_numpy()
print(X.shape, y.shape)

(86501, 13) (86501,)

In [8]:

# Visualising the final lable classes for training
sns.countplot(y)

/home/gauransh/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

Out[8]:

<AxesSubplot:ylabel='count'>

Splitting the data¶

In [9]:

# Splitting the training set, and training & validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
print(X_train.shape)
print(y_train.shape)

(69200, 13)
(69200,)

In [10]:

X_train[0], y_train[0]

Out[10]:

(array([244163550, 102695985,         2,         1,         7,         2,
                3,         0,         3,         0,         0,         0,
                5]),
 0)

Training the Model¶

In [11]:

model = AdaBoostClassifier()
model.fit(X_train, y_train)

Out[11]:

AdaBoostClassifier()

Validation¶

In [12]:

model.score(X_val, y_val)

Out[12]:

0.592566903647188

So, we are done with the baseline let's test with real testing data and see how we submit it to challange.

Predictions¶

In [13]:

# Separating data from the dataframe for final testing
X_test = test_data_df[train_cols[:-1]].to_numpy()
print(X_test.shape)

(15265, 13)

In [14]:

# Predicting the labels
predictions = model.predict(X_test)
predictions.shape

Out[14]:

(15265,)

In [15]:

# Converting the predictions array into pandas dataset
submission = pd.DataFrame({"readmitted":predictions})
submission

Out[15]:

	readmitted
0	0
1	0
2	1
3	1
4	0
...	...
15260	0
15261	0
15262	1
15263	0
15264	0

15265 rows × 1 columns

In [16]:

# Saving the pandas dataframe
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"), index=False)

Submitting our Predictions¶

Note : Please save the notebook before submitting it (Ctrl + S)

In [17]:

!!aicrowd submission create -c dbsra -f assets/submission.csv

Out[17]:

['submission.csv ━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 32.2/30.5 KB • 7.3 MB/s • 0:00:00',
 '                                  ╭─────────────────────────╮                                  ',
 '                                  │ Successfully submitted! │                                  ',
 '                                  ╰─────────────────────────╯                                  ',
 '                                        Important links                                        ',
 '┌──────────────────┬──────────────────────────────────────────────────────────────────────────┐',
 '│  This submission │ https://www.aicrowd.com/challenges/dbsra/submissions/172191              │',
 '│                  │                                                                          │',
 '│  All submissions │ https://www.aicrowd.com/challenges/dbsra/submissions?my_submissions=true │',
 '│                  │                                                                          │',
 '│      Leaderboard │ https://www.aicrowd.com/challenges/dbsra/leaderboards                    │',
 '│                  │                                                                          │',
 '│ Discussion forum │ https://discourse.aicrowd.com/c/dbsra                                    │',
 '│                  │                                                                          │',
 '│   Challenge page │ https://www.aicrowd.com/challenges/dbsra                                 │',
 '└──────────────────┴──────────────────────────────────────────────────────────────────────────┘',
 "{'submission_id': 172191, 'created_at': '2022-01-16T07:10:05.148Z'}"]

In [ ]:

DBSRA

[Getting Started Notebook] DBSRA Challange