Model Evaluation
The platform currently supports baseline evaluation. Baseline evaluation involves assessing the general capabilities of models using preset datasets (such as CMMLU, MMLU, C-Eval) and custom datasets. Preset datasets provide a widely recognized standard for testing a model’s knowledge and reasoning abilities across various disciplines, while custom datasets allow users to evaluate specific tasks or domains. The results of the baseline evaluation can be used to compare the performance of models to determine the effectiveness of model optimization.
Creating an Evaluation Task
Click on the product “Model Service Platform UModelVerse” — Function Menu “Model Evaluation” — Create Evaluation Task
Evaluation Settings
- Evaluation Method: The default is currently “Baseline Evaluation.”
- Choose Model Source: “Preset Dataset” or “Custom Dataset.”
- If “Preset Dataset” is chosen, select an appropriate dataset from CMMLU, MMLU, C-Eval, with dataset characteristics as follows:
| Item | CMMLU | MMLU | C-Eval |
|---|---|---|---|
| Definition | A comprehensive Chinese evaluation benchmark designed to assess the knowledge and reasoning abilities of language models in a Chinese context. | A large-scale multi-task language understanding benchmark intended to evaluate the knowledge acquired by models through zero-shot and few-shot settings. | A comprehensive Chinese foundational model evaluation suite consisting of 13,948 multiple-choice questions covering 52 different subjects and four difficulty levels. |
| Discipline Coverage | 67 topics, covering natural sciences, social sciences, engineering, humanities, and common sense. | 57 subjects, including STEM, humanities, social sciences, etc. | 52 sub-classes, covering STEM, social sciences, humanities, and other fields. |
| Question Type | Single-choice questions | Multiple-choice questions | Multiple-choice questions |
| Applicable Language | Chinese | English | Chinese |
| Features | Includes China-specific content, such as “Chinese Cuisine Culture,” “Ethnic Studies,” “Chinese Driving Rules,” etc. | The test content covers both world knowledge and problem-solving abilities. | A comprehensive testing mechanism for Chinese models. |
- If “Custom Dataset” is chosen, select a dataset that has been created in “Dataset Management” from the dropdown. If no custom dataset has been created, the system will guide you to the “Dataset Management” page to create one.
- Choose Evaluation Model: A baseline model must be selected, and a comparison model can be chosen for performance comparison. Both baseline and comparison models support selecting “Preset Model” or “My Model,” or a mix of both types.
- Storage Settings: Using an authorization token, select the token list associated with the bucket. Currently, only storage space in the North China Region II is supported.
Managing Tasks and Viewing Reports
After creating a task, you can view the task management list. Available actions for the tasks include: viewing reports, terminating, and deleting.
View Reports
- For baseline evaluation datasets, reports featuring metric scores and visual radar charts are supported.
- For custom datasets, metric scores, radar charts, and detailed single-sentence scores are available. When viewing evaluation details, the platform supports filtering and sorting single-sentence scores for models.
About Evaluation Report Metrics
- Preset Dataset Evaluation Metrics:
| Score Item | Description |
|---|---|
| STEM | The model’s ability to handle science, technology, engineering, math subjects. |
| Social Sciences | The model’s ability to handle social science subjects. |
| Humanities | The model’s ability to handle humanities subjects. |
| Other | The model’s ability to handle other subjects. |
| Average | Weighted comprehensive score of various metrics. |
- Custom Dataset Evaluation Metrics:
| Score Item | Description |
|---|---|
| ROUGE-1 | (Ignoring stop words) The recall rate calculated by splitting the model-generated result and the standard result by unigram. |
| ROUGE-2 | (Ignoring stop words) The recall rate calculated by splitting the model-generated result and the standard result by bigram. |
| ROUGE-L | (Ignoring stop words) Measures the longest common subsequence between the model-generated result and the standard result, calculating the recall rate. |
| BLEU | (Ignoring stop words) An indicator used to evaluate the differences between model-generated sentences and actual sentences, with values as a weighted average of unigram, bigram, trigram, and 4-grams. |
| Notes | Ⅰ) unigram: Each word in a sentence or text is considered a basic unit, disregarding word order. |
| Ⅱ) bigram: Each adjacent word pair in a sentence or text is considered a basic unit, used to describe the order relationship between two words. | |
| Ⅲ) trigram: Each adjacent group of three words in a sentence or text is considered a basic unit, used to describe the order relationship among three words. | |
| Ⅳ) 4-grams: Each adjacent group of four words in a sentence or text is considered a basic unit, used to describe the order relationship among four words. | |
| Ⅴ) Longest Common Subsequence: The longest subsequence present in each string among two or more strings, with the order being the same. |
Task Termination Operation
- Termination: The user can terminate the current task anytime during the evaluation task queue or while it is in progress.
- Deletion: Tasks that have failed training, completed, or been terminated can be deleted.