TokenSHAP: Interpreting Large Language Models with Monte Carlo Shapley Value Estimation

1 Introduction

Large language models (LLMs) have revolutionized natural language processing, achieving human-level performance on numerous tasks. However, their black-box nature presents significant interpretability challenges, particularly in critical applications like healthcare and legal analysis where understanding AI decision-making is essential.

TokenSHAP addresses this challenge by adapting Shapley values from cooperative game theory to attribute importance to individual tokens or substrings within input prompts. This provides a rigorous framework for understanding how different parts of an input contribute to a model's response.

2 Related Work

2.1 Interpretability in Machine Learning

Interpretability methods are broadly categorized into black-box and white-box approaches. Black-box methods like LIME and SHAP provide explanations without requiring model internal access, while white-box methods like gradient-based saliency maps and layer-wise relevance propagation require full model architecture knowledge.

2.2 Interpretability in Natural Language Processing

In NLP, attention visualization techniques have been widely used, but they often fail to provide quantitative importance measures. Recent approaches have explored feature attribution methods specifically designed for language models, though they face challenges with variable-length inputs and contextual dependencies.

3 TokenSHAP Methodology

3.1 Theoretical Framework

TokenSHAP extends Shapley values to variable-length text inputs by treating tokens as players in a cooperative game. The payoff function is defined as the similarity between model outputs with and without specific token subsets.

3.2 Monte Carlo Sampling Approach

To address computational complexity, TokenSHAP employs Monte Carlo sampling, randomly permuting tokens and computing marginal contributions. This approach scales efficiently with input length while maintaining theoretical guarantees.

4 Technical Implementation

4.1 Mathematical Formulation

The Shapley value for token $i$ is defined as:

$\\phi_i = \\sum_{S \\subseteq N \\setminus \\{i\\}} \\frac{|S|!(|N|-|S|-1)!}{|N|!} [v(S \\cup \\{i\\}) - v(S)]$

where $N$ is the set of all tokens, $S$ is a subset excluding token $i$, and $v(S)$ is the value function measuring model output quality for subset $S$.

4.2 Algorithm and Pseudocode

def tokenshap_importance(text, model, num_samples=1000):
    tokens = tokenize(text)
    n = len(tokens)
    shapley_values = np.zeros(n)
    
    for _ in range(num_samples):
        permutation = random_permutation(n)
        for i in range(n):
            S = set(permutation[:i])
            with_token = model.predict(include_tokens(S | {permutation[i]}))
            without_token = model.predict(include_tokens(S))
            marginal_contribution = similarity(with_token, without_token)
            shapley_values[permutation[i]] += marginal_contribution
    
    return shapley_values / num_samples

5 Experimental Results

5.1 Evaluation Metrics

TokenSHAP was evaluated using three key metrics: alignment with human judgments (measured by correlation with human-annotated importance scores), faithfulness (ability to reflect actual model behavior), and consistency (stability across similar inputs).

5.2 Comparative Analysis

Experiments across diverse prompts and LLM architectures (including GPT-3, BERT, and T5) demonstrated TokenSHAP's superiority over baselines like LIME and attention-based methods. The method showed 25% improvement in human alignment and 30% better faithfulness scores compared to existing approaches.

Human Alignment

25% Improvement

Faithfulness

30% Better Scores

Consistency

High Stability

6 Original Analysis

TokenSHAP represents a significant advancement in LLM interpretability by bridging game theory and natural language processing. The method's theoretical foundation in Shapley values provides a mathematically rigorous approach to feature attribution, addressing limitations of heuristic-based methods like attention visualization. Similar to how CycleGAN introduced cycle consistency for unpaired image translation, TokenSHAP establishes consistency in token importance attribution across different input variations.

The Monte Carlo sampling approach demonstrates remarkable computational efficiency, reducing the exponential complexity of exact Shapley value computation to practical levels for real-world applications. This efficiency gain is comparable to advancements in approximate inference methods seen in Bayesian deep learning, as documented in the Journal of Machine Learning Research. The method's ability to handle variable-length inputs distinguishes it from traditional feature attribution techniques designed for fixed-size inputs.

TokenSHAP's evaluation across multiple model architectures reveals important insights about LLM behavior. The consistent improvements in alignment with human judgments suggest that the method captures intuitive notions of importance better than attention-based approaches. This aligns with findings from the Stanford HAI group, which has emphasized the need for interpretability methods that match human cognitive processes. The faithfulness metrics indicate that TokenSHAP more accurately reflects actual model computations rather than providing post-hoc rationalizations.

The visualization capabilities of TokenSHAP enable practical applications in model debugging and prompt engineering. By providing quantitative importance scores, the method moves beyond qualitative assessments common in attention visualization. This quantitative approach supports more systematic analysis of model behavior, similar to how saliency maps evolved in computer vision interpretability. The method's consistency across similar inputs suggests robustness, addressing concerns about the stability of interpretability methods raised in recent literature from MIT's Computer Science and AI Laboratory.

7 Applications and Future Directions

TokenSHAP has immediate applications in model debugging, prompt optimization, and educational tools for AI literacy. Future directions include extending the method to multimodal models, real-time interpretation for conversational AI, and integration with model editing techniques. The approach could also be adapted for detecting model biases and ensuring fair AI deployment.

8 References

Lundberg, S. M., & Lee, S. I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why Should I Trust You?" Explaining the Predictions of Any Classifier. ACM SIGKDD.
Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. European Conference on Computer Vision.
Bach, S., et al. (2015). On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS ONE.