AI Security

Explainable AI Frameworks


Trust comes through understanding. As AI models grow in complexity, they often resemble a “black box,” where their decision-making processes become increasingly opaque. This lack of transparency can be a roadblock, especially when we need to trust and understand these decisions. Explainable AI (XAI) is the approach that aims to make AI’s decisions more transparent, interpretable, and understandable. As the demand for transparency in AI systems intensifies, a number of frameworks have emerged to bridge the gap between machine complexity and human interpretability. Some of the leading Explainable AI Frameworks include:

Explainable AI Frameworks

LIME (Local Interpretable Model-Agnostic Explanations)

Developed by Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin in 2016, LIME was introduced in their seminal paper, Why should I trust you?” Explaining the Predictions of Any Classifier. The trio, affiliated with the University of Washington, sought to address a pressing concern in the AI community: while machine learning models were achieving state-of-the-art performance across various tasks, their lack of transparency and interpretability posed significant challenges.

LIME’s primary objective is to provide explanations that are both locally faithful and interpretable. By “locally faithful,” it means that the explanation reflects the behavior of the model in the vicinity of the instance being interpreted, even if the global behavior of the model is complex. By “interpretable,” it ensures that the explanation is understandable to humans, typically by using simple models like linear regressions or decision trees.

What sets LIME apart from many other interpretability frameworks is its model-agnostic nature. While some methods are tailored to specific types of models, LIME can be applied to any machine learning model, from deep neural networks to ensemble methods. It achieves this by perturbing the input data, obtaining predictions from the black box model, and then using these perturbed data points to train an interpretable model that approximates the complex model’s decisions locally.

Another distinguishing feature of LIME is its flexibility in the choice of interpretable models and its ability to handle different types of data, including tabular data, text, and images. This versatility has made LIME a popular choice among researchers and practitioners seeking to demystify their machine learning models.

WIT (What-if Tool)

Developed by Google’s PAIR (People + AI Research) initiative, the What-if Tool (WIT) was introduced with the primary goal of making machine learning models more understandable, transparent, and accessible to both experts and non-experts alike. WIT provides an interactive visual interface that allows users to probe, visualize, and analyze the behaviour of their models without writing any code.

One of the standout features of WIT is its ability to facilitate counterfactual analysis. Users can modify input data points and instantly observe how these changes impact model predictions. This is particularly useful for understanding model behaviour in edge cases or for identifying potential biases in the model’s predictions.

Furthermore, WIT supports comparison of two models side-by-side, enabling users to understand how different models or model versions perform on the same dataset. This comparative analysis is invaluable for model evaluation, especially when trying to choose the best model for a specific task.

Another critical aspect of WIT is its fairness analysis. With growing concerns about biases in AI systems, WIT provides tools to analyze model performance across different user-defined slices of data. This helps in identifying whether a model is treating different groups of data (e.g., based on gender, race, or age) fairly or if there are discrepancies in its predictions.

Integration-wise, WIT is designed to be compatible with TensorBoard, making it easy for TensorFlow users to incorporate it into their workflows. However, its flexible nature means it can also be used with models trained on other machine learning frameworks.

The team’s paper introducing the tool is available here: The What-If Tool: Interactive Probing of Machine Learning Models.

SHAP (SHapley Additive exPlanations)

Shapley values and the SHAP framework have emerged as another leading framework in the quest to demystify complex machine learning models. Originating from cooperative game theory, Shapley values were introduced by Lloyd Shapley in 1953. In this context, they were used to fairly distribute the gains of a cooperative game among its players based on their individual contributions. The idea was to determine the worth of each player by considering all possible combinations of players and how much value each player added in each combination.

Fast forward to the realm of machine learning, and Shapley values have found a new playground. They are used to explain the output of any machine learning model by attributing the prediction output to its input features. In essence, for a given prediction, Shapley values help answer the question: “How much does each feature contribute to the prediction?”

The SHAP (SHapley Additive exPlanations) framework, introduced by researchers including Scott M. Lundberg and So-In Lee in 2017 (conceiving paper available here), builds upon Shapley values to provide a unified measure of feature importance for machine learning models. What sets SHAP apart is its foundation in game theory, ensuring a fair allocation of contributions for each feature. This results in explanations that are consistent and locally accurate, meaning the explanation reflects the behavior of the model in the vicinity of the instance being interpreted.

One of the standout features of SHAP is its ability to handle complex models, including ensemble models and deep learning networks. It provides visually intuitive plots, such as force plots, summary plots, and dependence plots, that allow users to grasp the model’s decision-making process quickly.

Furthermore, SHAP values have the desirable property of additivity. The sum of the SHAP values for all features equals the difference between the model’s prediction for the instance and the average prediction for all instances. This ensures that the contributions of all features are accounted for, leaving no room for unexplained model behavior.

DeepLIFT (Deep Learning Important FeaTures)

DeepLIFT, introduced by researchers Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje in 2017, is an interpretability algorithm tailored for deep learning models. Its primary objective is to compute the contribution of each neuron (or feature) to every prediction, providing insights into which parts of the input drive the model’s decisions.

Unlike some other interpretability methods that rely on input perturbations or surrogate models, DeepLIFT operates by comparing the activation of each neuron to a “reference activation” and determining how much each neuron contributed to the difference in output. This reference can be thought of as a baseline or neutral input, against which the actual input is compared.

One of the distinguishing features of DeepLIFT is its ability to handle non-linearities in deep neural networks. It does so by backpropagating the contributions from the output layer to the input layer, ensuring that the computed contributions are consistent with the actual flow of information within the network. This backpropagation approach allows DeepLIFT to capture intricate relationships and dependencies between neurons, offering a more comprehensive view of the network’s inner workings.

Furthermore, DeepLIFT addresses the saturation problem commonly encountered in deep networks. Saturation occurs when a neuron’s output remains unchanged despite variations in its input, leading to challenges in attributing importance. DeepLIFT’s methodology ensures that contributions are fairly allocated even in the presence of saturation, avoiding potential pitfalls in interpretation.

In essence, DeepLIFT serves as a bridge between the complex, multi-layered world of deep neural networks and the realm of human understanding. By quantifying the importance of each feature and providing a clear map of how information flows through the network, DeepLIFT helps achive more transparent, accountable, and trustworthy deep learning applications.


Developed by the team at DataScience Inc., Skater is a unified framework that leverages a variety of model-agnostic and model-specific interpretation techniques to shed light on the inner workings of machine learning models. Its primary goal is to help data scientists, domain experts, and decision-makers gain a deeper understanding of how models make predictions, thereby fostering trust and facilitating better decision-making.

One of Skater’s standout features is its model-agnostic approach. Regardless of whether you’re working with a linear regression, a random forest, or a deep neural network, Skater provides a suite of tools to interpret and visualize the model’s behavior. This flexibility makes it a valuable asset for practitioners working with diverse modeling techniques.

Skater offers a range of interpretation methods, including:

  • Feature Importance: Quantify the significance of each feature in making predictions.
  • Partial Dependence Plots: Visualize the relationship between features and the predicted outcome.
  • Local Interpretable Model-agnostic Explanations (LIME): Understand model predictions on a per-instance basis by approximating complex models with simpler, interpretable ones.
  • Activation Maps: For deep learning models, visualize which parts of an input (e.g., regions of an image) are most influential in driving the model’s decisions.

Additionally, Skater provides tools for model simplification, allowing users to approximate complex models with simpler, more interpretable ones without sacrificing too much accuracy. This can be particularly useful when deploying models in environments where interpretability is paramount.

AIX360 (AI Explainability 360)

Developed by IBM’s trusted AI team, AIX360 offers a diverse set of algorithms that cater to different explainability needs, ranging from directly interpretable models to post-hoc explanations and metrics. The toolkit is grounded in state-of-the-art research, ensuring that users have access to cutting-edge techniques to decipher their models.

Key features of AIX360 include:

  • Diverse Algorithms: AIX360 encompasses a wide array of explainability methods, ensuring that users can choose the most appropriate technique for their specific model and application. This includes methods like LIME, SHAP, and Contrastive Explanations Method (CEM), among others.
  • Interdisciplinary Approach: Recognizing that explainability is not just a technical challenge but also a cognitive and sociological one, AIX360 integrates insights from various disciplines, including cognitive psychology, to make explanations more intuitive and user-friendly.
  • Interactive Demos: To facilitate hands-on exploration, AIX360 provides interactive demos that allow users to visualize and understand the behavior of different algorithms in real-time.
  • Extensibility: Designed with the broader AI community in mind, AIX360 is built to be extensible, enabling researchers and developers to contribute new algorithms and improve existing ones.

One of the standout features of AIX360 is its emphasis on “neural network interpretability,” addressing the challenges posed by deep learning models, which are often perceived as highly opaque. By providing tools that can dissect these intricate models layer by layer, AIX360 makes it possible to gain insights into their decision-making processes.

In a world where AI-driven decisions impact everything from healthcare diagnoses to financial loan approvals, the importance of understanding and trusting these decisions cannot be overstated. AIX360 represents a significant stride in the journey towards more transparent, accountable, and ethical AI. By equipping practitioners with the tools they need to demystify their models, AIX360 ensures that AI systems remain comprehensible to the humans they serve.

Activation Atlases

Developed by researchers at OpenAI in collaboration with Google, Activation Atlases provide a high-level overview of the features a neural network detects. By aggregating activation vectors across a vast number of input samples, this method creates a comprehensive “atlas” that showcases how different regions of the model’s feature space respond to specific input patterns.

Key aspects of Activation Atlases include:

  • Granular Insights: Unlike traditional visualization methods that focus on individual neurons or layers, Activation Atlases capture the collective behavior of groups of neurons. This offers a more holistic view of the model’s internal representations.
  • Feature Visualization: Activation Atlases employ feature visualization techniques to generate synthetic images that represent the kind of input patterns that activate specific regions of the model’s feature space. This allows researchers to see, for instance, what kind of image features a convolutional neural network might be looking for at different layers.
  • Interactivity: One of the standout features of Activation Atlases is their interactive nature. Researchers can explore different regions of the atlas, zooming in and out to uncover varying levels of detail, making the exploration of complex models more intuitive.

Rulex Explainable AI

Rulex is an AI platform that focuses on generating logical rules from data, providing clear and actionable insights. Unlike traditional black-box models that offer predictions without context, Rulex’s approach is rooted in the generation of explicit rules that can be easily understood and validated by humans.

Key features of Rulex Explainable AI include:

  • Transparent Rule Generation: Rulex processes raw data and extracts patterns, translating them into a set of logical rules. These rules are not only interpretable but also actionable, allowing users to understand the rationale behind AI-driven decisions.
  • Model Simplicity: By focusing on rule-based logic, Rulex sidesteps the complexity of deep neural networks, ensuring that the generated models are both efficient and transparent.
  • Versatility: Rulex is designed to handle a wide range of data types and is applicable across various domains, from finance to healthcare, making it a versatile tool for diverse AI applications.
  • Human-in-the-loop Approach: Rulex places a strong emphasis on human expertise. By generating clear rules, it allows domain experts to validate, refine, or even challenge the AI’s findings, fostering a collaborative approach to decision-making.


Explainable AI Frameworks are more than just tools; they are the future of AI. As AI continues to play a pivotal role in various industries, the ability to interpret and understand its decisions becomes crucial. By leveraging these frameworks, we can pave the way for more transparent and accountable AI systems.

[email protected] | About me | Other articles

For 30+ years, I've been committed to protecting people, businesses, and the environment from the physical harm caused by cyber-kinetic threats, blending cybersecurity strategies and resilience and safety measures. Lately, my worries have grown due to the rapid, complex advancements in Artificial Intelligence (AI). Having observed AI's progression for two decades and penned a book on its future, I see it as a unique and escalating threat, especially when applied to military systems, disinformation, or integrated into critical infrastructure like 5G networks or smart grids. More about me, and about Defence.AI.

Related Articles

Share via
Copy link
Powered by Social Snap