Building Bayesian Networks with Python
Bayesian networks are probabilistic graphical models that represent the relationships between variables. They are powerful tools for reasoning under uncertainty, widely used in various domains like medical diagnosis, fault detection, and decision making.
1. Installation
We’ll be using the “pgmpy” library for working with Bayesian networks in Python. If you haven’t already, install it using pip:
pip install pgmpy
2. Creating a Bayesian Network
Let’s create a simple Bayesian network for a hypothetical scenario:
We have three variables:
- Cloudy: Whether it’s cloudy or not (True/False)
- Sprinkler: Whether the sprinkler is on (True/False)
- WetGrass: Whether the grass is wet (True/False)
Our intuition suggests the following relationships:
- Cloudy affects whether the sprinkler is on.
- Both cloudy and the sprinkler affect whether the grass is wet.
2.1. Defining the Structure
from pgmpy.models import BayesianModel # Define the network structure model = BayesianModel([('Cloudy', 'Sprinkler'), ('Cloudy', 'WetGrass'), ('Sprinkler', 'WetGrass')])
2.2. Specifying the Conditional Probability Tables (CPTs)
from pgmpy.factors.discrete import TabularCPD # Define CPT for Cloudy cpd_cloudy = TabularCPD(variable='Cloudy', variable_card=2, values=[[0.5], [0.5]]) # Define CPT for Sprinkler cpd_sprinkler = TabularCPD(variable='Sprinkler', variable_card=2, values=[[0.5, 0.9], [0.5, 0.1]], evidence=['Cloudy'], evidence_card=[2]) # Define CPT for WetGrass cpd_wetgrass = TabularCPD(variable='WetGrass', variable_card=2, values=[[0.9, 0.2, 0.9, 0.01], [0.1, 0.8, 0.1, 0.99]], evidence=['Cloudy', 'Sprinkler'], evidence_card=[2, 2]) # Add the CPTs to the model model.add_cpds(cpd_cloudy, cpd_sprinkler, cpd_wetgrass) # Check if the model is valid model.check_model() # Output: True
3. Parameter Learning
In real-world scenarios, we often need to learn the parameters (CPT values) of a Bayesian network from data.
3.1. Generating Sample Data
from pgmpy.inference import VariableElimination import numpy as np # Simulate data using the defined model data = model.simulate(n_samples=1000) data.head() # Output: # Cloudy Sprinkler WetGrass # 0 False True True # 1 False True True # 2 True False False # 3 True True True # 4 True False False ...
3.2. Learning Parameters from Data
from pgmpy.estimators import MaximumLikelihoodEstimator # Create a Maximum Likelihood Estimator estimator = MaximumLikelihoodEstimator(model, data) # Learn the parameters from the data learned_cpds = estimator.get_parameters() for cpd in learned_cpds: print(cpd) # Output: # +-----+---------+-----+----------+ # | Cloudy | Sprinkler | WetGrass | P(WetGrass | Cloudy, Sprinkler) | # +-----+---------+-----+----------+ # | True | True | True | 0.9122807017543859 | # | True | True | False | 0.08771929824561403 | # | True | False | True | 0.13333333333333333 | # | True | False | False | 0.8666666666666667 | # | False | True | True | 0.8085106382978723 | # | False | True | False | 0.19148936170212764 | # | False | False | True | 0.029411764705882353 | # | False | False | False | 0.9705882352941176 | # +-----+---------+-----+----------+ # # +-----+---------+-----+----------+ # | Cloudy | Sprinkler | P(Sprinkler | Cloudy) | # +-----+---------+-----+----------+ # | True | True | 0.9129889545781506 | # | True | False | 0.08701104542184938 | # | False | True | 0.6106382978723404 | # | False | False | 0.3893617021276596 | # +-----+---------+-----+----------+ # # +-----+---------+-----+----------+ # | Cloudy | P(Cloudy) | # +-----+---------+-----+----------+ # | True | 0.517 | # | False | 0.483 | # +-----+---------+-----+----------+
4. Inference
Once our Bayesian network is defined and parameterized, we can use it for inference, answering queries like “what is the probability of wet grass given it is cloudy?”
4.1. Variable Elimination
inference = VariableElimination(model) # Query for P(WetGrass=True | Cloudy=True) probability = inference.query(variables=['WetGrass'], evidence={'Cloudy': True}) print(probability) # Output: # +-----+----------+ # | WetGrass | P(WetGrass | Cloudy=True) | # +-----+----------+ # | True | 0.9122807017543859 | # | False | 0.08771929824561403 | # +-----+----------+
We see that the probability of the grass being wet given it’s cloudy is approximately 0.91.
5. Conclusion
This tutorial provided a basic introduction to creating and learning Bayesian networks in Python using pgmpy. The library offers a wide range of functionalities for defining, manipulating, learning, and performing inference on these powerful probabilistic models.