Constructing a sample
A sample is a subset of historical data that you can extract when you apply a selection or sampling method to the data source. A sample construction helps to construct development, validation, and test data sets for analysis and modeling.
In the Data preparation step, in the Sample construction workspace, from the Select the weight field if present drop-down list, click an available weight field.Typically, a weight field is available when you sample the data before using it in the Prediction Studio portal. If you do not specify the field, each case counts as one.
In the Select the fields to sample grid, specify the fields you want to include in the sample:
In the Type column, select a field type from the drop-down list.Select the Not used type for fields that you want to exclude from the sample.
In the Description column, enter a field definition.
In the User defined field, type a new name for a field.
Select a sampling method:
If Then If you want to sample a simple proportion of cases, select the Uniform sampling option.
This method fills the sample table with a random selection of records from the source. The probability of selection is set to achieve the specified percentage or number of cases.
If you want to sample a different proportion of each value for the selected field (stratum) that represents the behavior to be predicted, perform the following actions:
- Select the Stratified sampling option.
- From the Stratum field drop-down list, select the field you want to sample.
- In the table with stratum values, in the Ratio column, set the proportion of population cases to source records.
- In the Sample percentage column, enter the percentage of records that you want to sample. Population is a group of cases with known behavior which is consistent with the group of cases whose behavior you want to predict. You use the population to extract data samples for modeling and validation.
This method fills the sample table with random selections of each class.
In the Hold-out sets section, define the sample percentage that you want to use for development, validation, and testing:
- To divide cases among the sets, select the Setting percentages for each set option.
- To divide cases that are available for the field, select the User defined field option.
Select a field from the data source to assign the records with the same value to one hold-out set.You can place family members from the same household into one hold-out set. Family members might have similar profiles that can cause overfitting validation of data if they are not in one hold-out set.The type of hold-out set is selected at random.
Confirm the sample construction by clicking Next.
- Preparing data
The Data preparation step begins when you connect to a database or upload your data from a data set or a CSV file.