Evaluating Generalist LLMs on Legal Knowledge Tasks: A Comprehensive Benchmark Integrating Bloom’s Taxonomy on IT Regulations

Abstract: This paper introduces a comprehensive benchmark designed to rigorously evaluate general-purpose Large Language Models (LLMs) within legal contexts, with a particular emphasis on regulatory compliance in Information Technology (IT). By employing Bloom’s taxonomy as a structural framework, we systematically organize evaluation tasks across cognitive dimensions—memorization, understanding, analyzing, applying, evaluating, and creating. This benchmark features a meticulously constructed dataset containing multiple- choice questions (MCQs) and innovative open-ended analytical tasks tailored for legal analysis. Furthermore, we outline the creation of an associated training dataset using this structured approach to improve the practical performance of LLMs, initially focusing on IT regulations and subsequently extending to broader regulatory domains.

  1. Introduction: Effective legal reasoning and regulatory compliance are essential yet cognitively demanding areas where precision and comprehensive understanding are paramount. Existing benchmarks, such as LEXam (2025), LegalBench (2023), LLeQA (2023), and LexEval (2024), have contributed significantly to assessing legal reasoning capabilities but typically lack detailed cognitive structuring or domain-specific regulatory contexts. Historically, Bloom’s taxonomy has been instrumental in structuring cognitive assessments, providing a systematic framework to evaluate a range of cognitive tasks essential for nuanced legal reasoning. Extending this cognitive framework explicitly into IT regulatory scenarios addresses existing gaps by providing a detailed and structured assessment method specifically tailored for evaluating generalist LLM capabilities.

  1. Related Work: Previous influential benchmarks include:

Our benchmark synthesizes these approaches and introduces specialized IT regulatory tasks, effectively combining cognitive rigor with practical regulatory relevance.

  1. Dataset Construction:

Our dataset integrates Bloom’s taxonomy, meticulously structured to evaluate various cognitive abilities within legal contexts, especially focusing on Information Technology (IT) regulations. The dataset comprises three principal question types designed to thoroughly evaluate LLM capabilities:

  1. Multiple-Choice Questions (MCQs):

MCQs are carefully crafted to assess foundational cognitive skills such as memorization and basic comprehension of legal texts and regulatory standards. Questions explicitly target recall accuracy, terminology clarity, and immediate interpretive abilities.

  1. General Open-Ended Questions:

These tasks require LLMs to demonstrate intermediate cognitive skills, including interpretation, analysis, and application of broader legal and regulatory principles. General open-ended questions typically present scenarios without sector-specific constraints, assessing the model’s ability to reason and articulate legal arguments clearly and accurately.

  1. Domain-Specific Analytical Questions:

These tasks represent the core of our innovative dataset, structured to rigorously evaluate advanced cognitive skills such as deep analysis, critical evaluation, and creative synthesis within specific regulatory contexts (initially IT regulation). Each task explicitly aligns with Bloom's taxonomy:

Example Task:

  1. Scoring Methodology and Metrics:

Each task type employs specific scoring methodologies and metrics:

  1. Experimental Setup:

Our experimental setup adopts a rigorous evaluation framework that combines customized metrics aligned explicitly with task types and cognitive levels defined by Bloom’s taxonomy. Additionally, we incorporate metrics specific to the inherent nature of AI computational tasks, namely Generation, Extraction, Single-Label Classification (SLC), and Multi-Label Classification (MLC).

  1. Customized Metrics by Task Type and Cognitive Level:

To effectively evaluate the LLM performance across diverse legal cognitive tasks, we introduce metrics tailored specifically to each cognitive dimension and task format:

and assess compliance risks.

  1. Metrics Derived by AI Task Nature:

To further refine evaluations, we incorporate metrics categorized according to the intrinsic nature of the AI computations involved in each task:

  1. Scoring and Expert Review Methodology:

All quantitative metrics are complemented by qualitative expert evaluations. Legal experts systematically review a representative sample of generated outputs, scoring them based on accuracy, interpretive validity, regulatory relevance, and practical applicability.

  1. Tooling and Technical Considerations:

In addition to benchmark-level assessments, our SaaS platform “Jessica” by Contractzlab was

employed to support human annotation and evaluation. Jessica integrates a compliance

scoring engine tailored to regulatory assessment logic. This system enables precise scoring of model responses against predefined rules and compliance thresholds, including:

Computational Efficiency: Average inference time per question type/task nature.

Scalability and Generalizability: Assessed through task performance consistency across multiple regulatory domains and contexts.

This comprehensive experimental setup, integrating tailored cognitive and computational metrics, ensures a nuanced, robust, and practical evaluation of generalist LLM performance in legal and regulatory tasks, providing clear guidance for model improvements and practical deployments.

  1. Results and Analysis:

We conducted an exhaustive evaluation of multiple general-purpose Large Language Models (LLMs), including DeepSeek, Llama-4-Maverick, Mistral, Phi-4, GPT-4.1, and GPT-4o, across a diverse set of legal tasks categorized by Bloom’s taxonomy. The results illustrate substantial variability in performance, underscoring strengths and limitations inherent to each model.

  1. Quantitative Performance Overview

Models exhibited distinct performance patterns:

  1. Cognitive Level-Specific Insights

  1. Detailed Comparative Analysis

Quantitative analyses revealed nuanced insights into model performance:

  1. Model Limitations and Performance Gaps

Despite strengths, substantial limitations emerged across models:

  1. Expert Validation and Qualitative Assessment:

Legal expert assessments validated quantitative findings, providing deeper insights:

Expert evaluations also highlighted the practical implications of model performance limitations, emphasizing the importance of context-specific accuracy, interpretative depth, and compliance relevance.

  1. Discussion and Future Directions:

The detailed analyses underscore clear pathways for further research and model improvement:

  1. Conclusion:

This comprehensive benchmarking exercise rigorously assessed multiple general-purpose Large Language Models across a structured array of legal cognitive tasks using Bloom’s taxonomy. Our results underscore both substantial strengths and pronounced limitations of current LLMs, particularly their struggles with nuanced, domain-specific regulatory

reasoning. Recognizing these limitations, we are currently advancing our proprietary model, Mike, with a dedicated and systematic training regimen explicitly structured around Bloom’s taxonomy and specialized regulatory compliance datasets. This targeted approach positions Mike to overcome many of the identified limitations, potentially enabling it to deliver advanced legal reasoning, precise regulatory compliance analyses, and practical solutions

tailored to specific industry contexts. Future developments will focus on enhancing Mike’s capacity for interpretive depth, nuanced reasoning, and regulatory precision, setting a robust foundation for its deployment in complex legal and regulatory environments.

References: