ID.ai User Manual

Version 1.0 : 23/09/2024

Introduction

This User Manual (UM) provides the information necessary for analysts to effectively use the ID.ai platform for building – and developing documentation – statistical models needed for data driven business decision making.

Corestrat’s Model.ai offers model-building intelligence by removing the complexity of developing a predictive model without you having to write a single line of code. AI embedded Model.ai builds predictive models employing multiple ML techniques in quick seconds for the uploaded data and for your target goals

The no-code, enterprise-ready platform that allows you to build and deploy classification and/or regression predictive models in a few clicks. Model.ai helps enterprise users to get closer to smarter business decisions by capturing actionable and hidden patterns in the data.

Model.ai automates a large part of repetitive machine learning steps to ease the tasks of data scientists & non data scientists, thereby enabling enterprises swiftly adopting ML solutions and allowing them to focus on more complex issues.

Software Requirements Specifications

This section provides information about the minimum hardware and software configurations for installing the ID.ai software.

Operating SystemWindows 10 & Higher
Ram>=16GB
Disk Space>=200GB
SoftwareMS Word 16 & Higher

Any configuration below the acceptable versions/configurations described above will result in installation and/or performance issues of ID.ai.

Installation of ID.ai

This section provides the steps to install the ID.ai application on the user’s laptop.

On purchase, Corestrat will provide an exe file named “ID.ai” exe file which looks like the following image.

User should click on this exe file and will get the pop-up screen shown below requesting the user to proceed with the subsequent installation steps.  Click on “Install”. User should right click and run as administrator in case privileges are insufficient for direct installation.

The user needs to click on “Finish” once installation is complete.

The user will get a shortcut to open the ID.ai application on the desktop.

Click on the ID.ai icon which is shown below.

Getting Started

On opening the application user must input username and license key which would have been provided at the time of purchase. Please enter these in the relevant boxes in the opening screen shown below to proceed further.

There are six stages in building the desired outcome to drive data driven decision making at scale; these stages are illustrated in sequence in the picture below.

The various steps in the six stages will be described in the subsequent sections.

Upload Data Section

Project Creation & File Input

For building a new project, click on “New Project” and assign the project name in the pop-up screen as shown below.

Once user enters the name of the new project, (s)he can either just save it and work it it later or start working on it immediately using the respective options in the pop-up as shown below.

In the subsequent screen, the “project_name” and “project path” are displayed at top left side of screen. For starting a new project, the user must point to the folder location and file name of the input data source.

ID.ai supports data upload in any of the following formats:

  1. CSV
  2. Excel
  3. Feather
  4. Parquet
  5. Flat text files with delimiters

The minimum number of records needed is 200 with at least 5 columns and a minimum target variable count of 10.

If an invalid file is used for input, the following error message pops up.

User needs to click on “Close” in the pop-up and go back and upload a dataset in one of the correct formats.

Existing Projects

1. The list of recent projects worked by the user earlier is provided; user also has the option to search the project names

For importing a project from a different location user must

1. Select the Import Project option.

2. Browse to the location where your project is saved.

3. Choose the project folder you wish to import

4. Confirm the import, and the project will be added here for you view.

Once user clicks on “Launch Project” then import project is successfully loaded

Input Data Management

Once the input file has been uploaded, a summary records count will be displayed along with the first 1000 rows’ actual data. Do check that the row and column counts match with the actual input. The screen also provides information on empty and duplicate rows & columns.

The user has the following options to customize the inputs.

1. Remove duplicates and keep only unique records

2. Consider or ignore rows & columns with NaN (“not-a-number”) values

3. Provide a list of potential columns to exclude (e.g. phone number, PIN code) for faster computation.

These options can be selected/unselected by using the check box(es) under the “Rows and Columns” heading in the left of the screen as shown in the following screenshot.

ID.ai offers the following features:

1. Convert numerical columns to categorical columns and vice versa.

2. Provides a list of columns having fewer than 20 distinct variable values as likely candidates for treatment as categorical

3. Indicates some likely candidates for dropping from model build since they are likely to have no predictive power.

These options are highlighted on the right side of the screen as shown below.

In case the user already has a dataset where the target variables have already been defined and would like to use this file for model build, (s)he can upload this using the “Add Meta Data” option in the top right portion of the screen. This approach already has the ignore variables option already populated with likely exclusion candidates.

The procedure to use this file is similar to the normal approach described earlier – using either “Drag and Drop” or “Browse File” options as shown below.

Once use clicks on “Show sample” a small pop-up displays the likely candidates for the role categories as shown below.

If user would like to view the above in Excel format, (s)he needs to click on “Download Sample” and the same will be displayed in Excel format as shown below.

Once the Meta file is uploaded, the key characteristics are displayed on screen as shown below.

Once the user applies the relevant changes by clicking on “Apply Changes”, the summary of these is displayed.

Once the user has selected and customized the input variables, key summary statistics for these are provided in the next screen:

1. For numerical variables the total records count, count of records with missing values, mean, median, skewness and kurtosis values are provided.

2. For categorical variables the total records count, count of records with missing values, the value and count of the maximum occurring category within each categorical variable is provided.

To get a visual representation of any variable’s distribution, user can click on the bar chart icon under the histogram column (blue oval in the screenshot above). The resultant chart provides the lower and higher bound values, the 25th /50th/75th percentile values as well as any outliers. The range is also split into 10 bins by default and the count and share in each bin is provided.

Select Variables Section

Target Variable Selection

The first step is to select the variable that represents the outcome being modeled. ID.ai provides the following:

1. List of all variables in the dataset which have fewer than 10 distinct values called “candidate target variables”.

2. User can select one value from this list by moving it from the list on the left to the right using the “>” arrow.

3. To reverse the previous user can move variables from the right to the left using the “<” arrow. (both these are indicated in the blue oval heighted portion in the screenshot)

Once the target variable is selected, the next screen takes the user to the “Define target categories”. This provides a list of all the distinct values within the selected target variable. The user needs to select which among these distinct values will be considered as desired outcomes and which do not.

Stratified Sampling

The next step is to split the input dataset into the “train” and “test” samples. The model will be built on the “train” sample and the results will be applied on the “test” sample to check the integrity of the model build. The default is set at 70/30 split between train and test respectively, although the user has the option to customize the split.

Independent Variables Exclusion

The user can – based on business context – decide to remove one or more independent variables from consideration while building the model. The “variables to be ignored” tab provides the list of all independent variables and user can exclude specific ones by using the “>” arrow to move these from the left to the right.

Independent Variable Insights

Users can combine multiple values of each categorical variable into custom value groups based on similar predictive power or business context. The next few steps indicate the procedures for the same.

Categorical Variables

1. In the tab named “Target Rate Insights by Variables” the a. count, b. correlation between the target variable and the independent variable c. information value and d. a histogram of this independent variable’s splits and bad rate is provided.

2. If the user needs to combine some of the categories within this independent variable to get fewer categories, user needs to click on “Perform Manual Binning”

3. This leads to the next screen where “student” and “premier” categories are assigned the same value of 2(using the “Key in your splits” column) indicating they are both combined into this single category whereas “regular” category is retained as a separate category with a value of 1.

4. The resultant histogram providing the results from this modification is also generated.

5. Once the user is comfortable with this modification, (s)he needs to click on “submit” to save these changes.

Numerical Variables

For numerical variables similar metrics are provided as in categorical variables; the main difference being that three default segments are used as splits for the independent variable.

User can change the bin sizes and thresholds by using values for each bin’s upper and lower bound values using the “Key in Splits by Comma” box.

Once the user has analysed the results of the custom splits, user can “Confirm and Submit” the changes made.

Co-relation and multi collinearity

This tab provides the correlation values between the target variable and all the independent variables.  Using the measure of Variance Inflation Factor (VIF) the variables that fall into very high (VIF > 10), high (VIF between 5 & 10), moderate (VIF between 1.5 & 5) and low (VIF < 1.5) are listed.

The buttons “Top10 variables most corelated with target” and “Highly corelated variable pairs” provide the respective information in pop-ups as shown below.

Information Value

Information Value is a measure of the predictive power of the independent variable on the target variable. The IV value thresholds for suspicious, strong, moderate, weak are >2.0, 0.5 to 2.0, 0.1 to 0.5, 0.02 to 0.1 and < 0.02 respectively. These are provided in a bar chart representation. By hovering on any bar within the chart, the specific variable’s IV values can be seen.

Train a Model Section

Model Settings and Root Node

Once all the input data has been finalized, the next stage is to build the model. The first screen provides the record count, target count and target rate for the train and test samples. This root node will always be called as Node “ID 0”.

Before starting a model build, User has the option to customize the following parameters by clicking on the settings symbol on the top right corner of the screen (highlighted using a small blue circle above):

  1. Global parameters: node size, maximum split levels, pairwise correlation limits, VIF and IV
  2. Score scaling: Base Score, Base Odds and PDO (Probability of double odds)
  3. Decision tree parameters: Minimum cases (#) or targets (# or %) in a node
  4. ai parameters: p-value limit

Once the user has input the desired settings, these are saved by clicking on the “Update” and “Save Settings” button on the respective pop-up boxes.

Once the settings are finalized, the user can click on root node to get multiple options to either grow the decision tree (“Auto Grow”) or build an AI based model (“Run Model.ai”). In both these cases, the user does not have any control over the resulting tree and segmentation.

Auto Grow

When the “Auto Grow” option is selected, a tree is developed starting from the root node and progressing to subsequent levels based on robust separation within each level and across different levels.

If user does not wish to split a particular node into sub-classes, they can go back to that node and click on “collapse node” option.

To insert an additional split within a node, user can click on “Add you splits” within that node. User gets the options to choose any variable to be split then click on “split node”; the IV values are provided to make an informed decision. Do note that the variables with the highest IVs are placed at the top.

Once the automatically generated splits are available, user can generate custom splits within this variable by clicking on “add your splits”. This leads to a dialog box where user can assign the variable ranges in integer-based groups.

Another way to split is to “Specify the spilt point by comma separated” values in the right of the screen. Press “submit” after that click on expand node.

Run Model.ai

In this option, the user just clicks on “Run Model.ai” button and the resultant are automatically generated. User does not have the option to split or collapse these nodes.

User can click on “Click to view model.ai output” to see the following results for the node on which model.ai was run:

1. Node details, variable importance, model performance metrics & target rate by score.

2. Logistic regression technical details

3. Scorecard details for that node

4. User can click on “click to see the graph” in the model performance metrics tab to get a visual representation of KS and Gini values for the train and test samples.

Evaluate Your Model

Once the model has been built, the next stage involves analysing the performance of the mode. This section provides a description of the various performance metrics available in ID.ai to evaluate the model. The headings in the bullets refer to the buttons within the tool.

5. Model summary by Node – information on all nodes used in the model, the counts and bad rates in both Train and Test samples.

6. Model summary by score – counts and bad rates for each score bin and raw score for both Train and Test samples.

7 . KS & Gini chart – visual representation of KS and Gini values for both Train and Test samples

8. Model scorecard – this is used to scale the scores for the various nodes based on predefined base score values, odds and base score and probability of double odds.

User can see the scores generated for either the leaf nodes by clicking on the “Score card for Lead Nodes” button OR

Scores for the AI model node by clicking on the “Score card for Node ID 064” button

Build Your Decision

Once the scorecard has been developed, the user can upload an OOT (out-of-time) and simulate the decision based on the previously developed model. Simulation(s) can be done on the overall score or a particular sub-segment.

The first step is to upload a fresh OOT/unseen dataset which should contain ALL the variables in the model build stage. User needs to upload this file from the location to the tool as shown in the screenshot below. (The file specifications are the same as described earlier in Section 4.1.1)

User can choose between three different options before building a decision:

Option A – Reject Inferencing

Reject inference is a method for improving the quality of a model by incorporating data from previously rejected/unavailable records. Bias can result if a credit scorecard model is built only on accepts and does not account for applications rejected which are unavailable and hence have unknown target status.

9. First click on “Choose the variable containing past decisions”

10. Then click on “list of distinct values”

11. Then click on Build Decision

The next screen provides the population stability and characteristic stability.

User has the option to choose a distribution option between Overall, Decision Tree Leaf Nodes and the Specific Node using the drop-down menu in “Choose Distribution Option”.

The equivalent drop-down menu option for characteristic stability is available only when Model.ai is run

Option B – Define Actual Target

This approach enables the original model build dataset’s target variable to be used as the target variable in the newly uploaded dataset also.

  1. Click on “Choose the variable containing actual performance” to select the target variable
  2. Then click on “list of distinct values” to specific which values within the target variables should be considered as the target value
  3. Then click on “Build Decision” for validation

The next screen provides the population stability, KS & Gini and characteristic stability.

User has the option to choose a distribution option between Overall, Decision Tree Leaf Nodes and the Specific Node using the drop-down menu in “Choose Distribution Option”.

Population Distribution

K-S & Gini:

Characteristic stability:

Option C – Without Actual Target

In this approach, the OOT sample dataset does not have the target variable. The target is predicted by applying the model on the OOT dataset and predicting the target variable for each record based on the values of the independent variables.

The next screen provides the population stability and characteristic stability.

User has the option to choose a distribution option between Overall, Decision Tree Leaf Nodes and the Specific Node using the drop-down menu in “Choose Distribution Option”.

The equivalent drop-down menu option for characteristic stability is available only when Model.ai is run

Cut-off Decision Overall

This tab provides a summary of the scorecard performance when different cut-off thresholds are selected.  This overall approach does not allow segmented cut-offs which will be described in the next section.

This screen allows the user to view the score cut-offs and its impact on two pre-defined outcomes when compared to existing scenario

  1. Maintaining same target rate
  2. Maintaining same approval rate

Users can also choose actual score point values to analyse the impact of this on target and approval rates.

The impact of the above selections can be seen in terms of:

  1. Accept/Decline share before and after cut-off selection
  2. Bad rates of Accept/Decline before and after cut-off selection

Cut-off Decisions – Segmented

This option enables the user to set tailored cut-off decisions for specific segments by selecting a segmentation variable within individual score bins. This allows for decision making based on distinct groups within the chosen segmentation variable.

The screenshot below shows the distribution of all the cases by score bin.

Let us assume that the user would like to have a segmented cut-off decision using an additional variable named “APP_PROD_CODE” apart from the score. In this scenario, the user selects this variable in the Drop-down menu named “Choose segmented variable” available on the right side of the table – highlighted in blue oval.

The user will next be able to see the count of records within each score bin split into the different categories of the APP_PROD_CODE.

Two options are available to the user for segmented cut-offs.

Option 1:

  1. User can use the segmentation variable as-is and click only on the specific boxes which will be accepted and leaving the remaining boxes blank.
  2. Once user clicks on “Apply” after doing the above, the results of this decision will be displayed including the population decline/accept counts and the bad rates.
  3. User can do multiple iterations of the accept/decline combinations and every time the impacts will be available.
  4. Once satisfied with the segmented cut-off decision, user can click on “Save Decision” to store this decision.

Option 2:

This option is to combine multiple categories of the segmented variable.

  1. Click on “segment Binning”.
  2. In the ensuing pop-up screen assign the same integer values to those categories of the segmented variable that need to be combined.
  3. Clock on “Apply”
  4. In the subsequent screen, click only on the specific boxes which will be accepted and leave the remaining boxes blank.
  5. Once user clicks on “Apply” after doing the above, the results of this decision will be displayed including the population decline/accept counts and the bad rates.
  6. User can do multiple iterations of the accept/decline combinations and every time the impacts will be available.
  7. Once satisfied with the segmented cut-off decision, user can click on “Save Decision” to store this decision.

Auto Documentation

Once the model-build and cut-off decisions have been completed, the final stage is the generation of the “Decision Tree Technical Document” which provides a comprehensive document that will be invaluable for audit trail purposes. The user just needs to click on “Generate document” and a MS-Word version of the document will be available in 2-3 minutes.

While almost the entire document is pre-filled with the relevant information, the following three sections within it need to be filled by the user before dissemination.

  1. Executive Summary 
  2. Data Sources and Sampling 
  3. Decision Tree Fact Sheet

Since these sections involve the business context and user’s knowledge of the data, these ideally should be filled by the user.

Appendix A – Statistical Terms

  1. p-value: Measure that quantifies the evidence against a null hypothesis. A low p-value suggests data is inconsistent with the null, potentially favouring an alternative hypothesis. Common significance thresholds are 0.05 or 0.01.
  2. IV (Information Value): A numerical value that quantifies the predictive power of an independent continuous variable x in capturing the binary dependent variable y. IV is helpful for reducing the number of variables as an initial step in preparing for Logistic Regression, especially when there are a large number of potential variables. IV is based on an analysis of each individual independent variable in turn without considering other predictor variables.
  3. WOE (Weight of evidence): Closely related to the IV value, WOE measures the strength of each grouped attribute in predicting the desired value of the Dependent Variable.
  4. VIF (Variance Inflation Factor): A measure of multicollinearity among the independent variables in a multiple regression model.
  5. OOT (Out of Time) Sample: Used to indicate a dataset from a period outside the original model build window; used to validate the accuracy of the model in other time periods.
  6. Gini coefficient: Gini coefficient, commonly known as Gini, is a metric widely used to evaluate classification models. It ranges from 0 to 1, with zero representing perfect equality (no discrimination) and one representing perfect inequality (perfect discrimination). In the context of credit risk modelling, a higher Gini coefficient indicates better model performance in terms of its ability to accurately rank borrowers based on their creditworthiness.
  7. K-S (“Kolmogorov-Smirnov”) Value: The KS value provides a measure of the discriminatory power of a model. It looks at the maximum difference between the distribution of cumulative events and cumulative non-events and is a way of comparing the cumulative sum of the positive and negative classes. It measures the maximum difference between the two over the range of predicted probabilities. A high KS score indicates that the model has a better separation between the positive (goods) and negative classes (bads).
  8. Skewness: Measure the degree of asymmetry of a distribution.
  9. Kurtosis: Measure of the peak height of a distribution.
  10. Base score: The actual score point in the scaled scorecard which gives the base odds for the target variable to go into the desired state.
  11. Base odds: The odds of the target variable to go into the desired state at the base score.
  12. PDO (Probability of Double Odds): The actual score points difference needed to increase the odds of the target variable’s target rate twice.

Frequently Asked Questions

Q. What are the minimum specifications for a machine to install and run ID.ai?

A. The minimum configurations a system hosting ID.ai is >=16 GB RAM minimum, 200 GB Free disk space with MS Word Installed

Q. Can I save the project in a custom folder other than the default folder path provided in ID.ai?

A. No, currently this facility is not available; it will be enabled in a future version of ID.ai.

Q. Does ID.ai run on Macbooks also?

A. The current version of ID.ai runs only in the Windows environment. Future versions will be Apple OS compatible also.

Q. What statistical technique is used for building the model?

A. Logistic regression

Q. What should user do if activation is unsuccessful?

A. Please reach out to your company’s system administrator who purchased the license keys from Corestrat or drop a mail to solutions@corestrat.ai with your license key

Q. Where can I find the current project saved?

A. The default location is saved in the default local system path usually having the following location “C:\users\<machinename>\documents\Idai\<projectname>”

Q. Where can I find current autodocument to be saved.

A. The default location is saved in the default local system path usually having the following location “C:\users\<machinename>\documents\Idai\<projectname>\documents

Q. How can I start a new project while on one project?

A. In the home screen: Click on Home > All Projects > New Project

Q. How can I delete the project or dataset I uploaded?

A. There is an option to “delete” (trash can icon) under each project in the path above

Q. How is the performance of the model evaluated?

A. Evaluate Your Model

Q. Can I make predictions with new data after training the model?

A. Cut-off Decision Overall

Q. Can I use the model to score new customers or cases?

A. Cut-off Decisions – Segmented

Q. Can I save and export my trained model?

A. Yes. The trained model is saved in the default folder path provided in ID.ai as a file in ‘.pkl’ format.

Q. Can I generate scorecards from the trained model?

A. Yes. The scorecard generated from the trained model is available in the ‘Model Scorecard’ tab of the ‘Evaluate Your Model’ section.

Q. Does ID.ai automatically detect the target variable?

A. ID.ai suggests a list of potential target variables; user has the option to accept from those or use another target variable. Target Variable Selection

Q. Can the user select the features to be included in the model?

A. Yes. Independent Variables Exclusion

Q. Where can I get the template for metadata?

A. Under “Data Preprocessing” tab, on the top right use the button named “Add Meta Data”

Version History & Feedback

Version History:  
Version No.DateChanges
1.017 Sep 2024First version

Feedback:

Please reach out to us at solutions@corestrat.ai for any questions and issues.

Our office address is

# LGF, Tower ‘B’, Diamond District,
Old HAL airport Road,
Domlur, Bangalore 560008