Evaluating Generalist LLMs on Legal Knowledge Tasks: A Comprehensive
Benchmark Integrating Bloom’s Taxonomy on IT Regulations
Abstract: This paper introduces a comprehensive
benchmark designed to rigorously evaluate general-purpose Large Language Models (LLMs) within legal
contexts, with a particular emphasis on regulatory compliance in Information Technology (IT). By employing
Bloom’s taxonomy as a structural framework, we systematically organize evaluation tasks across
cognitive dimensions—memorization, understanding, analyzing, applying, evaluating, and creating. This
benchmark features a meticulously constructed dataset containing multiple- choice questions (MCQs) and
innovative open-ended analytical tasks tailored for legal analysis. Furthermore, we outline the creation of
an associated training dataset using this structured approach to improve the practical performance of LLMs,
initially focusing on IT regulations and subsequently extending to broader regulatory domains.
- Introduction: Effective legal reasoning and
regulatory compliance are essential yet cognitively demanding areas where precision and comprehensive
understanding are paramount. Existing benchmarks, such as LEXam (2025), LegalBench (2023), LLeQA (2023),
and LexEval (2024), have contributed significantly to assessing legal reasoning capabilities but
typically lack detailed cognitive structuring or domain-specific regulatory contexts. Historically,
Bloom’s taxonomy has been instrumental in structuring cognitive assessments, providing a
systematic framework to evaluate a range of cognitive tasks essential for nuanced legal reasoning.
Extending this cognitive framework explicitly into IT regulatory scenarios addresses existing gaps by
providing a detailed and structured assessment method specifically tailored for evaluating generalist
LLM capabilities.
- Related Work: Previous influential
benchmarks include:
- LEXam (2025): Focused on traditional
legal examination formats emphasizing cognitive recall and comprehension through MCQs and open-ended
questions.
- LegalBench (2023): Provided in-depth
evaluations through varied legal analytical and practical application scenarios.
- LLeQA (2023): Specialized in assessing
textual coherence and detailed interpretative reasoning.
- LexEval (2024): Evaluated analytical
proficiency extensively within Chinese legal contexts.
Our benchmark synthesizes these approaches and introduces specialized IT
regulatory tasks, effectively combining cognitive rigor with practical regulatory relevance.
-
Dataset Construction:
Our dataset integrates Bloom’s taxonomy, meticulously structured to evaluate
various cognitive abilities within legal contexts, especially focusing on Information Technology (IT)
regulations. The dataset comprises three principal question types designed to thoroughly evaluate LLM
capabilities:
-
Multiple-Choice Questions (MCQs):
MCQs are carefully crafted to assess foundational cognitive skills such as
memorization and basic comprehension of legal texts and regulatory standards. Questions explicitly target
recall accuracy, terminology clarity, and immediate interpretive abilities.
- Principle: MCQs assess precise recall
and basic understanding of legal provisions and regulatory terms.
- Metrics: Accuracy rate, response
consistency, and recall precision.
-
General Open-Ended Questions:
These tasks require LLMs to demonstrate intermediate cognitive skills, including
interpretation, analysis, and application of broader legal and regulatory principles. General open-ended
questions typically present scenarios without sector-specific constraints, assessing the model’s
ability to reason and articulate legal arguments clearly and accurately.
- Principle: Assess analytical depth,
interpretative coherence, and application accuracy.
- Metrics: Precision of legal reasoning,
interpretative coherence (BLEU and ROUGE scores), and qualitative evaluation by legal experts.
-
Domain-Specific Analytical Questions:
These tasks represent the core of our innovative dataset, structured to rigorously
evaluate advanced cognitive skills such as deep analysis, critical evaluation, and creative synthesis within
specific regulatory contexts (initially IT regulation). Each task explicitly aligns with Bloom's
taxonomy:
- Analyzing: Dissect complex regulatory
scenarios, identify critical legal issues, compliance requirements, and their implications.
- Applying: Direct application of legal
frameworks to practical scenarios, ensuring regulatory adherence.
- Evaluating: Critical assessment tasks
involving analysis of legal jurisdictions, risk assessments, and forecasting legal outcomes.
- Creating: Generative tasks requiring the
development of innovative compliance strategies and detailed policy recommendations.
Example Task:
- Context: AI CNIL v. Doctissimo (2023):
CNIL sanctions Doctissimo for breaches of GDPR through online symptom checker algorithms.
- Facts: Doctissimo operated an online
symptom checker processing users' health data without explicit consent, breaching data minimization
principles, and retaining sensitive data excessively.
- Legal Rules: Articles 6, 9, and 5 GDPR; EU
AI Act Recital 38 and Article 10.
- Expected Conditions: Explicit consent;
appropriate data retention; transparency in automated decision-making.
- Application Explanation: CNIL identified
violations due to the lack of explicit consent, insufficient user information, and excessive data
retention, highlighting high- risk implications under AI regulatory standards.
- Expected Result: A fine and mandated
corrective measures for compliance improvement.
- Case Reference: CNIL Sanction Decision No.
SAN-2023-001, Doctissimo, January 2023.
- Counterfactual Scenario: Implementation
of anonymized data processing methods and metadata collection only.
- New Fact: Redesigned system
incorporating anonymized user behavior logging without identifiable data.
- Expected Outcome: Compliance with GDPR
and lower risk categorization under the AI Act.
- Arguments Pro: Promotes digital health
innovation, rapid diagnostic access for users.
- Arguments Contra: Privacy violations,
consent breaches, and lack of transparency.
- Paraphrases: Automated health data
processing requires explicit consent, transparency, and strict adherence to GDPR and AI regulatory
frameworks.
-
Scoring Methodology and Metrics:
Each task type employs specific scoring methodologies and metrics:
- MCQs: Objective accuracy and recall.
- General Open-Ended Questions: Qualitative expert reviews, coherence metrics, and interpretative accuracy.
- Domain-Specific Analytical Questions: Multi-dimensional assessments combining legal accuracy, interpretive depth, cognitive rigor,
and practical applicability scores from experts.
-
Experimental Setup:
Our experimental setup adopts a rigorous evaluation framework that combines
customized metrics aligned explicitly with task types and cognitive levels defined by Bloom’s
taxonomy. Additionally, we incorporate metrics specific to the inherent nature of AI computational tasks,
namely Generation, Extraction, Single-Label Classification (SLC), and Multi-Label Classification
(MLC).
-
Customized Metrics by Task Type and Cognitive Level:
To effectively evaluate the LLM performance across diverse legal cognitive tasks, we
introduce metrics tailored specifically to each cognitive dimension and task format:
- Accuracy: Percentage of correctly answered
MCQs.
- Precision and Recall: Assessed for
questions involving terminological and definitional clarity.
-
Understanding (General Open-Ended Questions):
- BLEU and ROUGE Scores: Evaluate the
coherence and fluency of textual responses.
- Legal Concept Accuracy: Expert-validated
accuracy in interpreting and expressing legal principles.
-
Analyzing and Applying (Domain-Specific Analytical
Tasks):
- Analytical Depth Score: Expert-graded scale
assessing depth of legal analysis.
- Compliance Application Accuracy: Percentage
accuracy based on correctly identified and applied regulatory criteria.
-
Evaluating and Creating (Advanced Domain-Specific Tasks):
- Risk Assessment Accuracy: Evaluates the
model’s ability to correctly identify
and assess compliance risks.
- Creativity and Practicality Score: Expert
qualitative assessment rating innovative legal solutions and strategies.
-
Metrics Derived by AI Task Nature:
To further refine evaluations, we incorporate metrics categorized according to
the intrinsic nature of the AI computations involved in each task:
-
Generation Tasks (e.g., Creating):
- Coherence and Fluency (BLEU, ROUGE): Measures readability and semantic coherence.
- Novelty Index: Expert-assessed originality
and innovation in generated responses.
-
Extraction Tasks (e.g., Understanding, Analyzing):
- Extraction Precision and Recall: Quantitative metrics assessing the model's capability to accurately identify and extract
relevant legal facts, principles, and issues from textual scenarios.
-
Single-Label Classification (SLC) Tasks (MCQs, simple regulatory
compliance checks):
- Accuracy Rate: Proportion of correct
single-classification responses.
- Confidence Score Distribution: Statistical
analysis of model confidence levels for correct vs. incorrect classifications.
-
Multi-Label Classification (MLC) Tasks (complex regulatory
assessments, compliance verification tasks):
- Multi-label Accuracy (Subset Accuracy): Evaluates the precision in identifying multiple applicable legal criteria
simultaneously.
- F1-Score (Macro and Micro): Assesses
balanced precision and recall across multiple regulatory criteria.
-
Scoring and Expert Review Methodology:
All quantitative metrics are complemented by qualitative expert evaluations. Legal
experts systematically review a representative sample of generated outputs, scoring them based on accuracy,
interpretive validity, regulatory relevance, and practical applicability.
- Expert Validity Rating: Scale of
1–5 assessing overall legal correctness and applicability.
- Regulatory Compliance Score: Expert-rated accuracy of models in identifying, evaluating, and applying regulatory criteria
correctly.
-
Tooling and Technical Considerations:
In addition to benchmark-level assessments, our SaaS platform “Jessica”
by Contractzlab was
employed to support human annotation and evaluation. Jessica integrates a
compliance
scoring engine tailored to regulatory assessment logic. This system enables precise
scoring of model responses against predefined rules and compliance thresholds, including:
- Legal principle matching.
- Alignment with sector-specific obligations.
- Scoring based on automated legal heuristics.
Computational Efficiency: Average inference time per
question type/task nature.
Scalability and Generalizability: Assessed through task
performance consistency across multiple regulatory domains and contexts.
This comprehensive experimental setup, integrating tailored cognitive and
computational metrics, ensures a nuanced, robust, and practical evaluation of generalist LLM performance in
legal and regulatory tasks, providing clear guidance for model improvements and practical
deployments.
-
Results and Analysis:
We conducted an exhaustive evaluation of multiple general-purpose Large Language
Models (LLMs), including DeepSeek, Llama-4-Maverick, Mistral, Phi-4, GPT-4.1, and GPT-4o, across a diverse
set of legal tasks categorized by Bloom’s taxonomy. The results illustrate substantial variability in
performance, underscoring strengths and limitations inherent to each model.

-
Quantitative Performance Overview
Models exhibited distinct performance patterns:
- Argument Mining (Accuracy): Mistral (96.9%)
and Phi-4 (66.4%) significantly outperformed other models. Conversely, GPT-4.1 (1.3%) exhibited poor
performance, indicating limitations in its capability to effectively extract nuanced arguments.
- Article Recitation (Rouge-L): GPT-4.1
(45.9%) demonstrated superior capability in generating coherent textual content. GPT-4o and
Llama-4-Maverick also showed commendable performances, whereas DeepSeek and Phi-4 displayed limitations,
suggesting challenges with complex textual synthesis.
- Consultation (Rouge-L): Llama-4-Maverick
(26.1%) and GPT-4o (24.5%) achieved relatively higher coherence, though overall performance across
models remained limited, pointing toward inherent complexities and contextual nuances involved in
consultative legal tasks.
-
Cognitive Level-Specific Insights
- Memorization & Understanding: GPT
variants excelled in foundational cognitive tasks requiring recall and basic comprehension, reflecting
robust pre-training on extensive factual datasets.
- Analyzing & Applying: Mistral and Phi-4
excelled in complex analytical scenarios, demonstrating advanced capabilities in dissecting regulatory
contexts and applying specific compliance frameworks accurately.
- Evaluating & Creating: GPT-4o and
GPT-4.1 displayed notable strengths in advanced evaluative and generative tasks, adeptly crafting
contextually relevant legal arguments and strategic recommendations, albeit sometimes lacking detailed
specificity required by highly specialized regulatory contexts.
-
Detailed Comparative Analysis
Quantitative analyses revealed nuanced insights into model performance:
- Generation Tasks: GPT-4o demonstrated
consistent superiority in generative capabilities, showcasing both fluency and contextual coherence, as
evidenced by Rouge-L scores.
- Extraction and Classification Tasks:
Models like Mistral and Phi-4 excelled, particularly in tasks requiring precise extraction and accurate
classification of regulatory details, suggesting their suitability for structured analytical legal
applications.
-
Model Limitations and Performance Gaps
Despite strengths, substantial limitations emerged across models:
- Generalist vs. Domain-Specific Contexts: All models, especially GPT-4.1, showed reduced efficacy in domain-specific tasks,
reflecting the intrinsic limitations of generalist models in highly specialized regulatory
scenarios.
- Complex Legal Reasoning: None of the
models consistently achieved high performance across all cognitive levels, indicating the need for more
specialized training and refinement to effectively handle nuanced legal analyses and regulatory
compliance tasks.
-
Expert Validation and Qualitative Assessment:
Legal expert assessments validated quantitative findings, providing deeper
insights:
- GPT variants received recognition for general clarity and logical
coherence but faced criticism for superficial handling of domain-specific regulatory nuances.
- Mistral and Phi-4 garnered positive feedback for analytical
rigor and precision, yet experts noted occasional difficulties in contextual interpretation and
higher-order reasoning.
Expert evaluations also highlighted the practical implications of model
performance limitations, emphasizing the importance of context-specific accuracy, interpretative depth, and
compliance relevance.
-
Discussion and Future Directions:
The detailed analyses underscore clear pathways for further research and model
improvement:
- Domain-Specific Fine-Tuning: Targeted
fine-tuning on specific regulatory datasets is crucial for enhancing model performance, especially in
specialized contexts.
- Hybrid Methodologies: Integrating
rule-based compliance frameworks with LLM outputs can substantially enhance accuracy and reliability,
particularly in compliance- heavy scenarios.
- Expansion of Benchmarks: Extending
evaluations into other regulated domains such as healthcare, finance, and environmental law will provide
broader insights into model generalizability and robustness.
-
Conclusion:
This comprehensive benchmarking exercise rigorously assessed multiple
general-purpose Large Language Models across a structured array of legal cognitive tasks using Bloom’s
taxonomy. Our results underscore both substantial strengths and pronounced limitations of current LLMs,
particularly their struggles with nuanced, domain-specific regulatory
reasoning. Recognizing these limitations, we are currently advancing our proprietary
model, Mike, with a dedicated and systematic training regimen explicitly structured around Bloom’s
taxonomy and specialized regulatory compliance datasets. This targeted approach positions Mike to overcome
many of the identified limitations, potentially enabling it to deliver advanced legal reasoning, precise
regulatory compliance analyses, and practical solutions
tailored to specific industry contexts. Future developments will focus on enhancing
Mike’s capacity for interpretive depth, nuanced reasoning, and regulatory precision, setting a robust
foundation for its deployment in complex legal and regulatory environments.
References:
- LEXam (2025), GitHub, HuggingFace.
- LegalBench (2023), GitHub, OpenReview.
- LLeQA (2023), GitHub, HuggingFace dataset.
- LexEval (2024), ArXiv, GitHub.