ID.ai User Manual
Table of Contents
Introduction
This User Manual (UM) provides the information necessary for analysts to effectively use the ID.ai platform for building – and documenting the development – statistical models needed for data driven business Decision-making.
Corestrat’s Model.ai offers model-building intelligence by removing the complexity of developing a predictive model without you having to write a single line of code. AI embedded Model.ai builds predictive models employing multiple ML techniques in a few seconds for the uploaded data and for your target goals
The no-code, enterprise-ready platform enables you to build and deploy classification and/or regression predictive models in a few clicks. Model.ai helps enterprise users to get closer to smarter business decisions by capturing actionable and hidden patterns in the data.
Model.ai automates a large part of repetitive machine learning steps to ease the tasks of data scientists & non data scientists, thereby enabling enterprises to swiftly adopting ML solutions and allowing them to focus on more complex issues.
Software Requirements Specifications
This section provides information about the minimum hardware and software configurations for installing the ID.ai software.
Operating System | Windows 10 & Higher |
RAM | >=16 GB |
Disk Space | >=200 GB |
Software | MS Word 16 & Higher |
Any configuration below the acceptable versions/configurations described above will result in installation and/or performance issues of ID.ai.
Installation of ID.ai
This section provides the steps to install the ID.ai application on the user’s laptop.
On purchase, Corestrat will provide an exe file named “ID.ai” which looks like the following image.
User should click on this exe file and will get the pop-up screen shown below requesting the user to proceed with the subsequent installation steps. Click on “Install”. User should right click and “Run as Administrator” in case privileges are insufficient for direct installation.
The user needs to click on “Finish” once installation is complete.
The user will get a shortcut to open the ID.ai application on the desktop.
Click on the ID.ai icon as shown below.
Getting Started
On opening the application user must input their username and license key which would have been provided at the time of purchase. Please enter these in relevant fields in the opening screen shown below to proceed further.
There are six stages in building the desired outcome to drive data driven decision making at scale; these stages are illustrated in sequence in the picture below.
The Steps within the six stages will be described in the subsequent sections.
Upload Data Section
Project Creation & File Input
ID.ai can be used to build a new model from scratch or refine an existing model using new assumptions or additional data elements. Both options are illustrated in the screenshot below.
For building a new project, click on “New Project” and assign the project name in the pop-up screen as shown below.
Once user enters the name of the new project, they an either just save it and work on it later or start working on it immediately using the respective options in the pop-up as shown below.
In the subsequent screen, the “project_name” and “project path” are displayed at top left side of the screen. For starting a new project, the user must point to the folder location and file name of the input data source.
ID.ai supports data upload in any of the following formats:
1. CSV
2. Excel
3. Feather
4. Parquet
5. Flat text files with delimiters
The minimum number of records needed is 200 with at least 5 columns and a minimum target variable count of 10.
If an invalid file is used for input, the following error message pops up.
Import from Databricls
B) ID.ai supports data imports from Databricks.
1. Click on Import Data from DB.
2. Add Credentials and Click on Test Connection.
3. Write your SQL query to get and view the data. Click Next, once you are content with dataset and want to use it for model building.
User needs to click on “Close” in the pop-up and go back and upload a dataset in one of the correct formats.
Existing Projects
1. The list of recent projects worked on by the user earlier is provided; user also has the option to search the project names
2. For importing a project from a different location user must
1. Select the Import Project option.
2. Browse to the location where your project is saved.
3. Choose the project folder you wish to import
4. Confirm the import, and the project will be added here for you to view.
3. Once user clicks on “Launch Project” then import project is successfully loaded
Input Data Management
Once the input file has been uploaded, a summary record count will be displayed along with the first 1000 rows’ actual data. Do check that the row and column count match with the actual input. The screen also provides information on empty and duplicate rows & columns.
The user has the following options to customize the inputs.
1. Remove duplicates and keep only unique records
2. Consider or ignore rows & columns with NaN (“not-a-number”) values
3. Provide a list of potential columns to exclude (e.g. phone number, PIN code) for faster computation.
These options can be selected/unselected by using the check box(es) under the “Rows and Columns” heading in the left of the screen as shown in the following screenshot.
ID.ai offers the following features:
1. Convert numerical columns to categorical columns and vice versa.
2. Provides a list of columns having fewer than 20 distinct variable values as likely candidates for treatment as categorical
3. Indicates some likely candidates for dropping from model build since they are likely to have no predictive power.
These options are highlighted on the right side of the screen as shown below.
In case the user already has a dataset where the target variables have already been defined and would like to use this file for model build, they an upload this using the “Add Meta Data” option in the top right portion of the screen. This approach already has the ignore variables option already populated with likely exclusion candidates.
The procedure to use this file is similar to the normal approach described earlier – using either “Drag and Drop” or “Browse File” options as shown below.
Once use clicks on “Show sample” a small pop-up displays the likely candidates for the role categories as shown below.
If user would like to view the above in Excel format, they needs to click on “Download Sample” and the same will be displayed in Excel format as shown below.
Once the Meta file is uploaded, the key characteristics are displayed on screen as shown below.
Once the user applies the relevant changes by clicking on “Apply Changes”, the summary of these is displayed.
Feature Engineering:
Two-Way Interaction: A two-way interaction shows how the effect of one variable on an outcome changes depending on the level of another variable.
The user selects variables from the ‘List of Categorical Variables’ and clicks the ‘>’ symbol to move them to the ‘Selected Variables’ list.
The user selects variables from the ‘Selected Variables’ and clicks the ‘Apply’ button to move them to the “New Created Variable” list. After clicking on “Next” button to move them to the “Code-it Yourself” section.
Code-It-Yourself : Perform feature engineering by writing your own custom python code. Simply input your code to create or modify features based on your data. This gives you full flexibility to tailor features to your specific needs.
Variable List : Hover over each variable to check the variable type.Double click on the variable to populate it in the code box immediatel.
After click on “Compile” and “Excute” button then new variable name populated in List of Engineering Features Table.
Once the user has selected and customized the input variables, key summary statistics for these are provided in the next screen:
1. For numerical variables the total record count, count of records with missing values, mean, median, skewness and kurtosis values are provided.
2. For categorical variables the total record count, count of records with missing values, the value and count of the maximum occurring category within each categorical variable is provided.
To get a visual representation of any variable’s distribution, the user can click on the bar chart icon under the Histogram column (blue oval in the screenshot above). The resultant chart provides the lower and higher bound values, the 25th /50th/75th percentile values as well as any outliers. The range is also split into 10 bins by default and the count and share in each bin is provided.
Variable Transformation:
By clicking on ‘Transform,’ a popup screen opens displaying Box-Whisker and Histogram charts.
By choosing Tranformation Type then creating new box-whisker and Histogram charts
Here by click on “yes” retain variable other wise not retaining variable.it will saved by default “Transformed Variable name” or custom variable name then to click on “save” button.
Feature Engneering Numerical Variables,there is a option for transformation of variable
Feature Engneering Categorical Variables
Select Variables Section
Target Variable Selection
The first step is to select the variable that represents the outcome being modeled. ID.ai provides the following:
3. List of all variables in the dataset which have fewer than 10 distinct values called “candidate target variables”.
4. User can select one value from this list by moving it from the list on the left to the right using the “>” arrow.
5. To reverse the selection the user can move variables from the right to the left using the “<” arrow. (both this are indicated in the blue oval highlighted portion in the screenshot)
Once the target variable is selected, the next screen takes the user to the “Define Target Categories”. This provides a list of all the distinct values within the selected target variable. The user needs to select which among these distinct values will be considered as desired outcomes and which do not.
Stratified Sampling
The next step is to split the input dataset into the “train” and “test” samples. The model will be built on the “train” sample and the results will be applied on the “test” sample to check the integrity of the model build. The default is set at 70/30 split between train and test respectively, although the user has the option to customize the split.
Independent Variables Exclusion
The user can – based on business context – decide to remove one or more independent variables from consideration while building the model. The “variables to be ignored” tab provides the list of all independent variables and user can exclude specific ones by using the “>” arrow to move these from the left to the right.
Independent Variable Insights
Users can combine multiple values of each categorical variable into custom value groups based on similar predictive power or business context. The next few steps indicate the procedures for the same.
Categorical Variables
6. In the tab named “Target Rate Insights by Variables” the a). count, b.) correlation between the target variable and the independent variable c.) information value and d.) a histogram of this independent variable’s splits and bad rate is provided.
7. If the user needs to combine some of the categories within this independent variable to get fewer categories, user needs to click on “Perform Manual Binning”
8. This leads to the next screen where “student” and “premier” categories are assigned the same value of 2(using the “Key in your splits” column) indicating they are both combined into this single category whereas “regular” category is retained as a separate category with a value of 1.
9. The resultant histogram providing the results from this modification is also generated.
10. Once the user is comfortable with this modification, they needs to click on “submit” to save these changes.
Numerical Variables
For numerical variables similar metrics are provided as in categorical variables; the main difference being that three default segments are used as splits for the independent variable.
User can change the bin sizes and thresholds by using values for each bin’s upper and lower bound values using the “Key in Splits by Comma” box.
Once the user has analysed the results of the custom splits, user can “Confirm and Submit” the changes made.
Information Value
Information Value is a measure of the predictive power of the independent variable on the target variable. The IV value thresholds for suspicious, strong, moderate, weak are >2.0, 0.5 to 2.0, 0.1 to 0.5, 0.02 to 0.1 and < 0.02 respectively. These are provided in a bar chart representation. By hovering on any bar within the chart, the specific variable’s IV values can be seen.
Clustring:
This screen allows users to configure the VarClus Clustering Algorithm:
1. Number of Clusters or Variance Retention:
• Specify the desired number of clusters or the proportion of variance to retain.
• The algorithm will stop splitting clusters once either condition is met.
2. IV Threshold:
• Define an Information Value (IV) threshold. Variables falling outside this threshold will be excluded from the clustering process.
3. Clustering Method:
• Choose between clustering on the Weight of Evidence (WoE) of binned variables or on the original values of the variables.
Once the algorithm is executed, this screen displays:
1. Cluster Summary Table (Middle):
• Displays the number of clusters formed and the proportion of variance explained by the first principal component (PC1) of each cluster.
• Users can click on any cluster to manually select or change the variable that represents it.
2. Final Variable Table (Table 2):
• Lists the final set of variables selected for model building. All other variables will be discarded if the user saves the clustering results. 3. Summary Text Box (Right):
• Shows the proportion of variance explained by the final set of variables across the overall dataset.
click “Save” button, the clustering results to proceed with the final variable set. click “Skip” button,Discard the clustering results to retain the original variable set.
Correlation and multicollinearity
This tab provides the correlation values between the target variable and all the independent variables. Using the measure of Variance Inflation Factor (VIF) the variables that fall into very high (VIF > 10), high (VIF between 5 & 10), moderate (VIF between 1.5 & 5) and low (VIF < 1.5) are listed.
The buttons “Top 10 variables most correlated with target” and “Highly correlated variable pairs” provide the respective information in pop-ups as shown below.
Model Comaprison: Summary
– Create upto 3 model max and compare them
– Perform Ensembling, Compare Models with KS & Gini and Select Final Model to be used.
Train a Model Section
Model Settings and Root Node
Once all the input data has been finalized, the next stage is to build the model. The first screen provides the record count, target count and target rate for the train and test samples. This root node will always be called as Node “ID 0”.
Before starting a model build, the user has the option to customize the following parameters by clicking on the settings symbol on the top right corner of the screen (highlighted using a small blue circle above):
1. Global parameters: node size, maximum split levels, pairwise correlation limits, VIF and IV
2. Score scaling: Base Score, Base Odds and PDO (Probability of double odds)
3. Decision tree parameters: Minimum cases (#) or targets (# or %) in a node
4. Logistic model parameters: p-value limit
5. Random Forest and XGBoost parameters
Once the user has input the desired settings, these are saved by clicking on the “Update” and “Save Settings” button on the respective pop-up boxes.
Once the settings are finalized, the user can click on the root node to get multiple options to either grow the decision tree (“Auto Grow”) or build an AI based model (“Run logistic or Random Forest or XGBoost”). In both these cases, the user does not have any control over the resulting tree and segmentation.
Auto Grow
When the “Auto Grow” option is selected, a tree is developed starting from the root node and progressing to subsequent levels based on robust separation within each level and across different levels.
By Using ‘+’ button we will get new screen of root node of tree,here we can run different operations based on our requirement
Summay of all models look like below screen.
Based on user final selection model or by default automatically select model 1 then click on “Evaluate Your model” for further steps.
If user does not wish to split a particular node into sub-classes, they can go back to that node and click on “collapse node” option.
To insert an additional split within a node, user can click on “Add your splits” within that node. User gets the options to choose any variable to be split then click on “split node”; the IV values are provided to make an informed decision. Do note that the variables with the highest IVs are placed at the top.
Once the automatically generated splits are available, the user can generate custom splits within this variable by clicking on “add your splits”. This leads to a dialog box where user can assign the variable ranges in integer-based groups.
Another way to split is to “Specify the split point by comma-separated” values in the right of the screen. Press “Submit” after that click on Expand Node.
Run Model.ai
In this option, the user just clicks on “Run Model.ai” button and the results are automatically generated. User does not have the option to split or collapse these nodes.
User can click on “Click to view Model.ai output” to see the following results for the node on which model.ai was run:
1. Node details, variable importance, model performance metrics & target rate by score.
2. Logistic regression technical details
3. Scorecard details for that node
4. User can click on “click to see the graph” in the model performance metrics tab to get a visual representation of KS and Gini values for the train and test samples.
1.Random Forest Model Tree
2.Random Forest provides a bar graph of variable importance and a SHAP chart, which can be viewed by clicking on ‘Click to view SHAP Chart.’
3.User can click on “click to see the graph” in the model performance metrics tab to get a visual representation of KS and Gini values for the train and test samples.
1.XGBoost model Tree
2.XGBoost provides a bar graph of variable importance and a SHAP chart, which can be viewed by clicking on ‘Click to view SHAP Chart.’ It also provides KS and GINI charts.
3.User can click on “click to see the graph” in the model performance metrics tab to get a visual representation of KS and Gini values for the train and test samples.
Evaluate Your Model
Once the model has been built, the next stage involves analysing the performance of the mode. This section provides a description of the various performance metrics available in ID.ai to evaluate the model. The headings in the bullets refer to the buttons within the tool.
5. Model summary by Node – information on all nodes used in the model, the counts and bad rates in both Train and Test samples.
6. Model summary by score – counts and bad rates for each score bin and raw score for both Train and Test samples.
7. KS & Gini chart – visual representation of KS and Gini values for both Train and Test samples
8. Model scorecard – this is used to scale the scores for the various nodes based on predefined base score values, odds and base score and probability of double odds.
User can see the scores generated for either the leaf nodes by clicking on the “Score card for Leaf Nodes” button OR
Scores for the AI logistic model node by clicking on the “Score card for Node ID 04” button
Shapely Values for the AI Random Forest or XGBoost model node by clicking on the “Shapely values for Node ID 04” button
Deployemnt:
User click on “click to deploy” then we will get “API Endpoint”
While to run “API Endpoint” to get Json response.
Build Your Decision
Once the scorecard has been developed, the user can upload an OOT (out-of-time) to simulate the decision based on the previously developed model. Simulation(s) can be done on the overall score or a particular sub-segment.
The first step is to upload a fresh OOT/unseen dataset which should contain ALL the variables in the model build stage. User needs to upload this file from the location to the tool as shown in the screenshot below. (The file specifications are the same as described earlier in Section 4.1.1)
User can choose between three different options before building a decision:
Option A – Reject Inferencing
Reject inference is a method for improving the quality of a model by incorporating data from previously rejected/unavailable records. Bias can result if a credit scorecard model is built only on accepts and does not account for applications rejected that were rejected and hence have an unknown target status.
9. Then Click on Choose the variable containing past decisions”
10. After that, click on “List of Distinct Values”
11. Then click on Build Decision
The next screen provides the population stability and characteristic stability.
User has the option to choose a distribution option between Overall, Decision Tree Leaf Nodes and the Specific Node using the drop-down menu in “Choose Distribution Option”.
The equivalent drop-down menu option for characteristic stability is available only when Model.ai is run
Option B – Define Actual Target
This approach enables the original model build dataset’s target variable to be used as the target variable in the newly uploaded dataset also.
1. Click on “Choose the variable containing actual performance” to select the target variable
2. Then click on “List of Distinct Values” to specify which values within the target variables should be considered as the target value
3. Then click on “Build Decision” for validation
The next screen provides the population stability, KS & Gini and characteristic stability.
User has the option to choose a distribution option between Overall, Decision Tree Leaf Nodes and the Specific Node using the drop-down menu in “Choose Distribution Option”.
Population Distribution:
K-S & Gini:
Characteristic stability:
Option C – Without Actual Target
In this approach, the OOT sample dataset does not have the target variable. The target is predicted by applying the model on the OOT dataset and predicting the target variable for each record based on the values of the independent variables.
The next screen provides the population stability and characteristic stability.
User has the option to choose a distribution option between Overall, Decision Tree Leaf Nodes and the Specific Node using the drop-down menu in “Choose Distribution Option”.
The equivalent drop-down menu option for characteristic stability is available only when Model.ai is run.
Cut-off Decision Overall
This tab provides a summary of the scorecard performance when different cut-off thresholds are selected. This overall approach does not allow segmented cut-offs, which will be described in the next section.
This screen allows the user to view the score cut-offs and their impact on two pre-defined outcomes when compared to existing scenario
1. Maintaining same target rate
2. Maintaining same approval rate
Users can also choose actual score point values to analyse the impact of this on target and approval rates.
The impact of the above selections can be seen in terms of:
1. Accept/Decline share before and after cut-off selection
2. Bad rates of Accept/Decline before and after cut-off selection
Cut-off Decisions – Segmented
This option enables the user to set tailored cut-off decisions for specific segments by selecting a segmentation variable within individual score bins. This allows for decision making based on distinct groups within the chosen segmentation variable.
The following screenshot shows the distribution of all the cases by score bin.
Let us assume that the user would like to have a segmented cut-off decision using an additional variable named “APP_PROD_CODE” apart from the score. In this scenario, the user selects this variable in the Drop-down menu named “Choose segmented variable” available on the right side of the table – highlighted in blue oval.
The user will next be able to see the count of records within each score bin split into the different categories of the APP_PROD_CODE.
Two options are available to the user for segmented cut-offs.
Option 1:
1. User can use the segmentation variable as-is and click only on the specific boxes which will be accepted and leaving the remaining boxes blank.
2. Once user clicks on “Apply” after doing the above, the results of this decision will be displayed including the population decline/accept counts and the bad rates.
3. User can do multiple iterations of the accept/decline combinations and every time the impacts will be available.
4. Once satisfied with the segmented cut-off decision, user can click on “Save Decision” to store this decision.
Option 2:
This option is to combine multiple categories of the segmented variable.
1. Click on “segment Binning”.
2. In the ensuing pop-up screen assign the same integer values to those categories of the segmented variable that need to be combined.
3. Click on “Apply”
4. In the subsequent screen, click only on the specific boxes which will be accepted and leave the remaining boxes blank.
5. Once user clicks on “Apply” after doing the above, the results of this decision will be displayed including the population decline/accept counts and the bad rates.
6. User can do multiple iterations of the accept/decline combinations and every time the impacts will be available.
7. Once satisfied with the segmented cut-off decision, user can click on “Save Decision” to store this decision.
Auto Documentation
Once the model-build and cut-off decisions have been completed, the final stage is the generation of the “Decision Tree Technical Document” which provides a comprehensive document that will be invaluable for audit trail purposes. The user just needs to click on “Generate document” and an MS-Word version of the document will be available in 2-3 minutes.
While almost the entire document is pre-filled with the relevant information, the following three sections must be filled by the user before dissemination.
1. Executive Summary
2. Data Sources and Sampling
3. Decision Tree Fact Sheet
Since these sections involve the business context and user’s knowledge of the data, these ideally should be filled by the user.
Appendix A – Statistical Terms
1. p-value: Measure that quantifies the evidence against a null hypothesis. A low p-value suggests data is inconsistent with the null, potentially favouring an alternative hypothesis. Common significance thresholds are 0.05 or 0.01.
2. IV (Information Value): A numerical value that quantifies the predictive power of an independent continuous variable x in capturing the binary dependent variable y. IV is helpful for reducing the number of variables as an initial step in preparing for Logistic Regression, especially when there are a large number of potential variables. IV is based on an analysis of each individual independent variable in turn without considering other predictor variables.
3. WOE (Weight of evidence): Closely related to the IV value, WOE measures the strength of each grouped attribute in predicting the desired value of the Dependent Variable.
4. VIF (Variance Inflation Factor): A measure of multicollinearity among the independent variables in a multiple regression model.
5. OOT (Out of Time) Sample: Used to indicate a dataset from a period outside the original model build window; used to validate the accuracy of the model in other time periods.
6. Gini coefficient: Gini coefficient, commonly known as Gini, is a metric widely used to evaluate classification models. It ranges from 0 to 1, with zero representing perfect equality (no discrimination) and one representing perfect inequality (perfect discrimination). In the context of credit risk modelling, a higher Gini coefficient indicates better model performance in terms of its ability to accurately rank borrowers based on their creditworthiness.
7. K-S (“Kolmogorov-Smirnov”) Value: The KS value provides a measure of the discriminatory power of a model. It looks at the maximum difference between the distribution of cumulative events and cumulative non-events
and is a way of comparing the cumulative sum of the positive and negative classes. It measures the maximum difference between the two over the range of predicted probabilities. A high KS score indicates that the model has a better separation between the positive (goods) and negative classes (bads).
8. Skewness: Measure the degree of asymmetry of a distribution.
9. Kurtosis: Measure of the peak height of a distribution.
10. Base score: The actual score point in the scaled scorecard which gives the base odds for the target variable to go into the desired state.
11. Base odds: The odds of the target variable to go into the desired state at the base score.
12. PDO (Probability of Double Odds): The actual score points difference needed to increase the odds of the target variable’s target rate twice.
Frequently Asked Questions
Q. What are the minimum specifications for a machine to install and run ID.ai?
A. The minimum configurations a system hosting ID.ai is >=16 GB RAM minimum, 200 GB Free disk space with MS Word Installed
Q. Can I save the project in a custom folder other than the default folder path provided in ID.ai?
A. No, currently this facility is not available; it will be enabled in a future version of ID.ai.
Q. Does ID.ai run on Macbooks also?
A. The current version of ID.ai runs only in the Windows environment. Future versions will be Apple OS compatible also.
Q. What statistical technique is used for building the model?
A. Logistic regression
Q. What should user do if activation is unsuccessful?
A. Please reach out to your company’s system administrator who purchased the license keys from Corestrat or drop a mail to solutions@corestrat.ai with your license key
Q. Where can I find the current project saved?
A. The default location is saved in the default local system path usually having the following location “C:\users\<machinename>\documents\Idai\<projectname>”
Q Where can I find current autodocument to be saved.
A. The default location is saved in the default local system path usually having the following location “C:\users\<machinename>\documents\Idai\<projectname>\documents
Q. How can I start a new project while on one project?
A. In the home screen: Click on Home > All Projects > New Project
Q. How can I delete the project or dataset I uploaded?
A. There is an option to “delete” (trash can icon) under each project in the path above
Q. How is the performance of the model evaluated? A. Evaluate Your Model
Q. Can I make predictions with new data after training the model? A. Cut-off Decision Overall
Q. Can I use the model to score new customers or cases? A. Cut-off Decisions – Segmented
Q. Can I save and export my trained model?
A. Yes. The trained model is saved in the default folder path provided in ID.ai as a file in ‘.pkl’ format.
Q. Can I generate scorecards from the trained model?
A. Yes. The scorecard generated from the trained model is available in the ‘Model Scorecard’ tab of the ‘Evaluate Your Model’ section.
Q. Does ID.ai automatically detect the target variable?
A. ID.ai suggests a list of potential target variables; user has the option to accept from those or use another target variable. Target Variable Selection
Q. Can the user select the features to be included in the model? A. Yes. Independent Variables Exclusion
Q. Where can I get the template for metadata?
A. Under “Data Preprocessing” tab, on the top right use the button named “Add Meta Data”
Version History & Feedback
Version History: | ||
Version No. | Date | Changes |
1 | 17-Sep-24 | First version |
Feedback: Please reach out to us at solutions@corestrat.ai for any questions and issues.
Our office address is
# LGF, Tower ‘B’, Diamond District, Old HAL airport Road, Domlur, Bangalore 560008