Case Study
Using AI to Automate Feature Analysis for Navy Training Assessments
CRESST has developed a framework for the U.S. Navy that pairs AI-generated assessment drafts with expert review, enabling the creation of higher-quality job training more quickly and cost-effectively.
90
enlisted job categories that needed ratings
87
identification of core features
91
on-target features aligned with tasks
The Client
The work was conducted for the U.S. Navy, which designs performance assessments across 90+ enlisted ratings, each with distinct tasks and competency requirements.
Website
Services
In a pioneering project with the US Navy, UCLA CRESST embarked on an investigation into the potential of large language models (LLMs) to augment the critical process of feature analysis. This process, fundamental to assessment design, involves the qualitative identification of the cognitive and task demands required to perform successfully. The core objective was to determine if modern AI could serve as an effective assistant to human experts in this complex, analytical endeavor. To explore this, the team developed and refined prototype workflows utilizing commercial LLMs. These models were tasked with reading detailed task descriptions and generating initial proposals for candidate features, such as information analysis, documentation, and systems use. A significant focus was placed on iterative prompt engineering, incorporating clear definitions, constraints, and few-shot examples to enhance the precision and coverage of the model’s output. Crucially, the project maintained a “human-in-the-loop” philosophy; the LLM’s suggestions were treated strictly as decision support to inform and accelerate the work of subject-matter experts (SMEs), who retained full responsibility for curating and validating all final features.
- Prototyped workflows using commercial LLMs to read task descriptions and propose candidate features (e.g., information analysis, documentation, systems use).
- Iterated prompts (clear definitions, constraints, and few-shot examples) to improve precision and coverage.
- Treated model output as decision support, not a verdict: subject-matter experts (SMEs) curated and validated all suggestions.
Research Questions
Traditional feature analysis is slow and costly, especially for the Navy, which is responsible for generating performance assessments across 90+ enlisted job ratings. We asked whether LLMs could:
- Accelerate early-stage analysis for new or revised assessments,
- Increase consistency in how features are documented across ratings, and
- Reduce costs while keeping SMEs in the loop to protect validity.
Feature Analysis: A More Complete Definition
Feature Analysis is a procedure used to qualitatively describe the characteristics of assessment and instructional tasks across content domains (or in the case of the Navy, ratings). It is used both to confirm that design specifications (qualitative validity) were met and, in conjunction with outcome data, to identify the features that affect performance. The analysis may include attributes such as those found in the CRESST Training Assessment Framework (TAF), including types of and components of cognitive demands, e.g., components of problem solving, linguistic elements, and estimated difficulty, used in a standardized format. Feature Analysis results provide useful indicators for understanding training performance. A complete analysis typically covers:
- Context & Conditions: operational setting, situation, scenario; tools or systems used; available references and resources; and other environmental conditions and constraints.
- Knowledge, Skills, and Cognitive Processes: domain knowledge, procedures, heuristics, and levels of reasoning (e.g., recall vs. analysis vs. judgment).
- Actions & Interactions: solo vs. team coordination, handoffs, communications, and use of digital or physical systems.
- Complexity & Difficulty Drivers: cue ambiguity, time pressure, information load, branching paths, and error likelihood.
- Standards & Criteria: what “good” looks like (e.g., observable behaviors, product/process evidence, accuracy thresholds, and tolerances).
- Fairness & Fidelity Considerations: minimizing construct-irrelevant variance, ensuring accessibility, and matching the realism needed to elicit valid evidence.
Implementation Using AI - Approach at a Glance
- Supplied the AI with Grounded Task Examples: We provided the AI with concrete tasks associated with various jobs (e.g., a Personnel Specialist reviewing travel claims) across three different Navy enlisted ratings; namely: Damage Controlman, Personnel Specialist, and Fire Controlman. For each task (two per job), the AI selected and determined the most relevant set of features.
- Prompt Engineering: We crafted instructions provided to the AI (e.g., formatting, positive/negative examples, required justifications) to ensure high-quality outputs.
- SME Triage: Subject matter experts (SMEs) reviewed the AI-generated outputs and rapidly accepted, edited, or rejected suggestions using a human-defined rubric.
- Reliability Checks: We compared agreement across large language models (LLMs) and prompt variants to gauge consistency and accuracy.
Impact
- Demonstrated Feasibility: LLMs consistently (91% of the time) were found to produce plausible, on-target features aligned with SME expectations tasks.
- Time & Cost Potential: As a first-pass generator, LLM support reduces assessment development time for designers and concentrates SME efforts on validation rather than drafting.
- Consistency Gains: Cross-model comparisons yielded stable identification of core features (87% agreement).
- Human-in-the-Loop Preserved: The most effective pattern was LLM first draft followed by SME curation, balancing speed with rigor and protecting assessment validity.