Defining text analysis sampling
This is the second step of the Create text analysis models wizard. In this step, you upload a file containing the data that the system uses to train the text analysis model and determine its accuracy. The training data file consists of records. Each record is a text unit (for example, a tweet, Facebook comment, and so on) with associated data (like the expected result and the type of the record). The system classifies the records as training samples and test samples. The purpose of training samples is to help create and train the text analysis model. The purpose of the test sample is to validate the accuracy of the model that the system created.
- Click Choose File and select a .CSV, .XLS, or .XLSX extension file that contains the training data from your directory.
Note: The training file must contain at least ten records. To ensure the best accuracy of the text analysis, the file should contain the following columns (each column represents a distinct type of information about each record):- Content - It contains the text units (for example, sample Facebook comments, tweets, emails, and so on) for analysis.
- Result - It contains the expected training result of the analysis (for example, the category a given text unit should be assigned to) for each record. The expected results you specified in this column must match the expected results in the .CSV file associated with the taxonomy selected in the first step of the wizard.
- Type - It indicates whether a particular record belongs to the test or training sample. If you want a given record to be identified as part of the test sample, you must enter test in the type field of that record. If you want the system to identify a specific record as part of the training sample, you can leave the type field of that record blank. This column is not mandatory.
Note: You can download the data source template .XLSX file in which you can place the training data for the analysis.
- For the classification analysis: Select Use taxonomy data for training models if you want to include the taxonomy data in the training sample.
- Define the training sample details:
- Click Using 'type' column as percentage to include in the test sample all records from the training data file with their type specified as test. The system uses the remaining records as the training sample. The number of records identified as part of the test sample must not exceed 70% of the total number of records.
- Click Using custom percentage to define a custom percentage of records to be included in the training sample. The training sample must comprise of at least 70% of the total number of records. Select this option when the training data file does not contain the type column or when you want the system to randomly select records for the training sample.
- Click Generate preview.
- In the PREVIEW SAMPLING section, review the Training sample and the Test sample tabs generated for the model. The preview shows up to ten results of the analysis.
- Click Next.
Open topic with navigation