Evaluating Generalist LLMs on Legal Knowledge Tasks: A Comprehensive Benchmark Integrating Bloom’s Taxonomy on IT Regulations

Abstract: This paper introduces a comprehensive benchmark designed to rigorously evaluate general-purpose Large Language Models (LLMs) within legal contexts, with a particular emphasis on regulatory compliance in Information Technology (IT). By employing Bloom’s taxonomy as a structural framework, we systematically organize evaluation tasks across cognitive dimensions—memorization, understanding, analyzing, applying, evaluating, and creating. This benchmark features a meticulously constructed dataset containing multiple- choice questions (MCQs) and innovative open-ended analytical tasks tailored for legal analysis. Furthermore, we outline the creation of an associated training dataset using this structured approach to improve the practical performance of LLMs, initially focusing on IT regulations and subsequently extending to broader regulatory domains.

Introduction: Effective legal reasoning and regulatory compliance are essential yet cognitively demanding areas where precision and comprehensive understanding are paramount. Existing benchmarks, such as LEXam (2025), LegalBench (2023), LLeQA (2023), and LexEval (2024), have contributed significantly to assessing legal reasoning capabilities but typically lack detailed cognitive structuring or domain-specific regulatory contexts. Historically, Bloom’s taxonomy has been instrumental in structuring cognitive assessments, providing a systematic framework to evaluate a range of cognitive tasks essential for nuanced legal reasoning. Extending this cognitive framework explicitly into IT regulatory scenarios addresses existing gaps by providing a detailed and structured assessment method specifically tailored for evaluating generalist LLM capabilities.

Related Work: Previous influential benchmarks include:

LEXam (2025): Focused on traditional legal examination formats emphasizing cognitive recall and comprehension through MCQs and open-ended questions.
LegalBench (2023): Provided in-depth evaluations through varied legal analytical and practical application scenarios.
LLeQA (2023): Specialized in assessing textual coherence and detailed interpretative reasoning.
LexEval (2024): Evaluated analytical proficiency extensively within Chinese legal contexts.

Our benchmark synthesizes these approaches and introduces specialized IT regulatory tasks, effectively combining cognitive rigor with practical regulatory relevance.

Dataset Construction:

Our dataset integrates Bloom’s taxonomy, meticulously structured to evaluate various cognitive abilities within legal contexts, especially focusing on Information Technology (IT) regulations. The dataset comprises three principal question types designed to thoroughly evaluate LLM capabilities:

Multiple-Choice Questions (MCQs):

MCQs are carefully crafted to assess foundational cognitive skills such as memorization and basic comprehension of legal texts and regulatory standards. Questions explicitly target recall accuracy, terminology clarity, and immediate interpretive abilities.

Principle: MCQs assess precise recall and basic understanding of legal provisions and regulatory terms.
Metrics: Accuracy rate, response consistency, and recall precision.

General Open-Ended Questions:

These tasks require LLMs to demonstrate intermediate cognitive skills, including interpretation, analysis, and application of broader legal and regulatory principles. General open-ended questions typically present scenarios without sector-specific constraints, assessing the model’s ability to reason and articulate legal arguments clearly and accurately.

Principle: Assess analytical depth, interpretative coherence, and application accuracy.
Metrics: Precision of legal reasoning, interpretative coherence (BLEU and ROUGE scores), and qualitative evaluation by legal experts.

Domain-Specific Analytical Questions:

These tasks represent the core of our innovative dataset, structured to rigorously evaluate advanced cognitive skills such as deep analysis, critical evaluation, and creative synthesis within specific regulatory contexts (initially IT regulation). Each task explicitly aligns with Bloom's taxonomy:

Analyzing: Dissect complex regulatory scenarios, identify critical legal issues, compliance requirements, and their implications.
Applying: Direct application of legal frameworks to practical scenarios, ensuring regulatory adherence.
Evaluating: Critical assessment tasks involving analysis of legal jurisdictions, risk assessments, and forecasting legal outcomes.
Creating: Generative tasks requiring the development of innovative compliance strategies and detailed policy recommendations.

Example Task:

Context: AI CNIL v. Doctissimo (2023): CNIL sanctions Doctissimo for breaches of GDPR through online symptom checker algorithms.
Facts: Doctissimo operated an online symptom checker processing users' health data without explicit consent, breaching data minimization principles, and retaining sensitive data excessively.
Legal Rules: Articles 6, 9, and 5 GDPR; EU AI Act Recital 38 and Article 10.
Expected Conditions: Explicit consent; appropriate data retention; transparency in automated decision-making.
Application Explanation: CNIL identified violations due to the lack of explicit consent, insufficient user information, and excessive data retention, highlighting high- risk implications under AI regulatory standards.
Expected Result: A fine and mandated corrective measures for compliance improvement.
Case Reference: CNIL Sanction Decision No. SAN-2023-001, Doctissimo, January 2023.
Counterfactual Scenario: Implementation of anonymized data processing methods and metadata collection only.
New Fact: Redesigned system incorporating anonymized user behavior logging without identifiable data.
Expected Outcome: Compliance with GDPR and lower risk categorization under the AI Act.
Arguments Pro: Promotes digital health innovation, rapid diagnostic access for users.
Arguments Contra: Privacy violations, consent breaches, and lack of transparency.
Paraphrases: Automated health data processing requires explicit consent, transparency, and strict adherence to GDPR and AI regulatory frameworks.

Scoring Methodology and Metrics:

Each task type employs specific scoring methodologies and metrics:

MCQs: Objective accuracy and recall.
General Open-Ended Questions: Qualitative expert reviews, coherence metrics, and interpretative accuracy.
Domain-Specific Analytical Questions: Multi-dimensional assessments combining legal accuracy, interpretive depth, cognitive rigor, and practical applicability scores from experts.

Experimental Setup:

Our experimental setup adopts a rigorous evaluation framework that combines customized metrics aligned explicitly with task types and cognitive levels defined by Bloom’s taxonomy. Additionally, we incorporate metrics specific to the inherent nature of AI computational tasks, namely Generation, Extraction, Single-Label Classification (SLC), and Multi-Label Classification (MLC).

Customized Metrics by Task Type and Cognitive Level:

To effectively evaluate the LLM performance across diverse legal cognitive tasks, we introduce metrics tailored specifically to each cognitive dimension and task format:

Memorization (MCQs):

Accuracy: Percentage of correctly answered MCQs.
Precision and Recall: Assessed for questions involving terminological and definitional clarity.

Understanding (General Open-Ended Questions):

BLEU and ROUGE Scores: Evaluate the coherence and fluency of textual responses.
Legal Concept Accuracy: Expert-validated accuracy in interpreting and expressing legal principles.

Analyzing and Applying (Domain-Specific Analytical Tasks):

Analytical Depth Score: Expert-graded scale assessing depth of legal analysis.
Compliance Application Accuracy: Percentage accuracy based on correctly identified and applied regulatory criteria.

Evaluating and Creating (Advanced Domain-Specific Tasks):

Risk Assessment Accuracy: Evaluates the model’s ability to correctly identify

and assess compliance risks.

Creativity and Practicality Score: Expert qualitative assessment rating innovative legal solutions and strategies.

Metrics Derived by AI Task Nature:

To further refine evaluations, we incorporate metrics categorized according to the intrinsic nature of the AI computations involved in each task:

Generation Tasks (e.g., Creating):

Coherence and Fluency (BLEU, ROUGE): Measures readability and semantic coherence.
Novelty Index: Expert-assessed originality and innovation in generated responses.

Extraction Tasks (e.g., Understanding, Analyzing):

Extraction Precision and Recall: Quantitative metrics assessing the model's capability to accurately identify and extract relevant legal facts, principles, and issues from textual scenarios.

Single-Label Classification (SLC) Tasks (MCQs, simple regulatory compliance checks):

Accuracy Rate: Proportion of correct single-classification responses.
Confidence Score Distribution: Statistical analysis of model confidence levels for correct vs. incorrect classifications.

Multi-Label Classification (MLC) Tasks (complex regulatory assessments, compliance verification tasks):

Multi-label Accuracy (Subset Accuracy): Evaluates the precision in identifying multiple applicable legal criteria simultaneously.
F1-Score (Macro and Micro): Assesses balanced precision and recall across multiple regulatory criteria.

Scoring and Expert Review Methodology:

All quantitative metrics are complemented by qualitative expert evaluations. Legal experts systematically review a representative sample of generated outputs, scoring them based on accuracy, interpretive validity, regulatory relevance, and practical applicability.

Expert Validity Rating: Scale of 1–5 assessing overall legal correctness and applicability.
Regulatory Compliance Score: Expert-rated accuracy of models in identifying, evaluating, and applying regulatory criteria correctly.

Tooling and Technical Considerations:

In addition to benchmark-level assessments, our SaaS platform “Jessica” by Contractzlab was

employed to support human annotation and evaluation. Jessica integrates a compliance

scoring engine tailored to regulatory assessment logic. This system enables precise scoring of model responses against predefined rules and compliance thresholds, including:

Legal principle matching.
Alignment with sector-specific obligations.
Scoring based on automated legal heuristics.

Computational Efficiency: Average inference time per question type/task nature.

Scalability and Generalizability: Assessed through task performance consistency across multiple regulatory domains and contexts.

This comprehensive experimental setup, integrating tailored cognitive and computational metrics, ensures a nuanced, robust, and practical evaluation of generalist LLM performance in legal and regulatory tasks, providing clear guidance for model improvements and practical deployments.

Results and Analysis:

We conducted an exhaustive evaluation of multiple general-purpose Large Language Models (LLMs), including DeepSeek, Llama-4-Maverick, Mistral, Phi-4, GPT-4.1, and GPT-4o, across a diverse set of legal tasks categorized by Bloom’s taxonomy. The results illustrate substantial variability in performance, underscoring strengths and limitations inherent to each model.

Quantitative Performance Overview

Models exhibited distinct performance patterns:

Argument Mining (Accuracy): Mistral (96.9%) and Phi-4 (66.4%) significantly outperformed other models. Conversely, GPT-4.1 (1.3%) exhibited poor performance, indicating limitations in its capability to effectively extract nuanced arguments.
Article Recitation (Rouge-L): GPT-4.1 (45.9%) demonstrated superior capability in generating coherent textual content. GPT-4o and Llama-4-Maverick also showed commendable performances, whereas DeepSeek and Phi-4 displayed limitations, suggesting challenges with complex textual synthesis.
Consultation (Rouge-L): Llama-4-Maverick (26.1%) and GPT-4o (24.5%) achieved relatively higher coherence, though overall performance across models remained limited, pointing toward inherent complexities and contextual nuances involved in consultative legal tasks.

Cognitive Level-Specific Insights

Memorization & Understanding: GPT variants excelled in foundational cognitive tasks requiring recall and basic comprehension, reflecting robust pre-training on extensive factual datasets.
Analyzing & Applying: Mistral and Phi-4 excelled in complex analytical scenarios, demonstrating advanced capabilities in dissecting regulatory contexts and applying specific compliance frameworks accurately.
Evaluating & Creating: GPT-4o and GPT-4.1 displayed notable strengths in advanced evaluative and generative tasks, adeptly crafting contextually relevant legal arguments and strategic recommendations, albeit sometimes lacking detailed specificity required by highly specialized regulatory contexts.

Detailed Comparative Analysis

Quantitative analyses revealed nuanced insights into model performance:

Generation Tasks: GPT-4o demonstrated consistent superiority in generative capabilities, showcasing both fluency and contextual coherence, as evidenced by Rouge-L scores.
Extraction and Classification Tasks: Models like Mistral and Phi-4 excelled, particularly in tasks requiring precise extraction and accurate classification of regulatory details, suggesting their suitability for structured analytical legal applications.

Model Limitations and Performance Gaps

Despite strengths, substantial limitations emerged across models:

Generalist vs. Domain-Specific Contexts: All models, especially GPT-4.1, showed reduced efficacy in domain-specific tasks, reflecting the intrinsic limitations of generalist models in highly specialized regulatory scenarios.
Complex Legal Reasoning: None of the models consistently achieved high performance across all cognitive levels, indicating the need for more specialized training and refinement to effectively handle nuanced legal analyses and regulatory compliance tasks.

Expert Validation and Qualitative Assessment:

Legal expert assessments validated quantitative findings, providing deeper insights:

GPT variants received recognition for general clarity and logical coherence but faced criticism for superficial handling of domain-specific regulatory nuances.
Mistral and Phi-4 garnered positive feedback for analytical rigor and precision, yet experts noted occasional difficulties in contextual interpretation and higher-order reasoning.

Expert evaluations also highlighted the practical implications of model performance limitations, emphasizing the importance of context-specific accuracy, interpretative depth, and compliance relevance.

Discussion and Future Directions:

The detailed analyses underscore clear pathways for further research and model improvement:

Domain-Specific Fine-Tuning: Targeted fine-tuning on specific regulatory datasets is crucial for enhancing model performance, especially in specialized contexts.
Hybrid Methodologies: Integrating rule-based compliance frameworks with LLM outputs can substantially enhance accuracy and reliability, particularly in compliance- heavy scenarios.
Expansion of Benchmarks: Extending evaluations into other regulated domains such as healthcare, finance, and environmental law will provide broader insights into model generalizability and robustness.

Conclusion:

This comprehensive benchmarking exercise rigorously assessed multiple general-purpose Large Language Models across a structured array of legal cognitive tasks using Bloom’s taxonomy. Our results underscore both substantial strengths and pronounced limitations of current LLMs, particularly their struggles with nuanced, domain-specific regulatory

reasoning. Recognizing these limitations, we are currently advancing our proprietary model, Mike, with a dedicated and systematic training regimen explicitly structured around Bloom’s taxonomy and specialized regulatory compliance datasets. This targeted approach positions Mike to overcome many of the identified limitations, potentially enabling it to deliver advanced legal reasoning, precise regulatory compliance analyses, and practical solutions

tailored to specific industry contexts. Future developments will focus on enhancing Mike’s capacity for interpretive depth, nuanced reasoning, and regulatory precision, setting a robust foundation for its deployment in complex legal and regulatory environments.

References:

LEXam (2025), GitHub, HuggingFace.
LegalBench (2023), GitHub, OpenReview.
LLeQA (2023), GitHub, HuggingFace dataset.
LexEval (2024), ArXiv, GitHub.

Evaluating Generalist LLMs on Legal Knowledge Tasks: A Comprehensive Benchmark Integrating Bloom’s Taxonomy on IT Regulations

Dataset Construction:

Multiple-Choice Questions (MCQs):

General Open-Ended Questions:

Domain-Specific Analytical Questions:

Scoring Methodology and Metrics:

Experimental Setup:

Customized Metrics by Task Type and Cognitive Level:

Memorization (MCQs):

Understanding (General Open-Ended Questions):

Analyzing and Applying (Domain-Specific Analytical Tasks):

Evaluating and Creating (Advanced Domain-Specific Tasks):

Metrics Derived by AI Task Nature:

Generation Tasks (e.g., Creating):

Extraction Tasks (e.g., Understanding, Analyzing):

Single-Label Classification (SLC) Tasks (MCQs, simple regulatory compliance checks):

Multi-Label Classification (MLC) Tasks (complex regulatory assessments, compliance verification tasks):

Scoring and Expert Review Methodology:

Tooling and Technical Considerations:

Results and Analysis:

Quantitative Performance Overview

Cognitive Level-Specific Insights

Detailed Comparative Analysis

Model Limitations and Performance Gaps

Expert Validation and Qualitative Assessment:

Discussion and Future Directions:

Conclusion:

References: