Probability Distribution through Python Code : Data Science in Experiment

--

Data Science is one of the most emerging domains & most sought-after career opportunities. It uses scientific approaches, statistical methods, computer science algorithms, and operations to obtain facts & insights from different forms of datasets. To predict the user requirements, organizational insights, operational cost analysis, and other analytical visualizations, data science is a proven tool. Among its various approaches, probability distribution plays a vital role in delivering data analysis. This article will guide you with the top categories & types of probability distribution methods, techniques, and Python programs data analysts use for analyzing large datasets.

Probability Distribution in Python

A Probability Distribution is a function of statistics that helps in describing the likelihood of achieving the potential values from random variables. It determines all the possibilities that a random variable can present from a range of values. This range contains a lower bound and an upper bound that comprise the minimum & the maximum possible values required to analyze the dataset.

There are multiple circumstances on which different analytics value depends. Among them, standard deviation, average, and skewness are prominent. Probability distribution empowers data analysts to identify and perceive patterns from large data sets. Thus, it plays a crucial role in summarizing which data set to consider from a large cluster of semi-structured and unstructured data. Data science using Python allows density function & distribution techniques to plot data, visually analyze data and extract insights from them.

General Properties of Probability Distributions

Probability distribution defines the possibility of any consequence from a given data set. This mathematical expression uses a precise value of x and determines the likelihood of a random variable with p(x). Probability distribution follows some general properties listed below –

· The result of all possibilities for any feasible value tends to become equal to 1.

· When a probability distribution method is applied to any data, the possibility of any particular value or a range of values must lie in the range of 0 & 1.

· Probability distributions is meant to show the dispersal of the values. Accordingly, the type of variable helps in determining the standard of probability distribution.

List of some well-known Probability distributions used in Data Science

Here is a list of the popular types of Probability distribution explained with a python code that every data science aspirant should know. (Use Jupyter Notebook to practice them)

· Bernoulli Distribution: It is one of the simplest and most common probability distribution types. It uses the concept of Binomial distribution, where n=1. It means a binomial distribution takes ’n’ number of trials, where n > 1 whereas, the Bernoulli distribution takes only a single trial. The Bernoulli Probability distribution will accept n number of trials, known as Bernoulli Trials. Any random experiment will have one of the two outcomes (either a failure or a Success). The Bernoulli event is the action based on which the probability of occurrence of the event is ‘p’, and the probability of the event not occurring is ‘1-p’.

Program –

import seaborn as sb

from scipy.stats import bernoulli

def bernoulliDist():

bernoulli_data = bernoulli.rvs(size = 860, p = 0.6)

aw = sb.distplot(bernoulli_data, kde = True, color = ‘b’, hist_kws = {‘alpha’ : 1}, kde_kws = {‘color’: ‘r’, ‘lw’: 3, ‘label’: ‘KDE’})

aw.set(xlabel = ‘Bernouli Values’, ylabel = ‘Frequency Distribution’)

bernoulliDist()

· Normal Distribution: It is also known as Gaussian distribution, which is another popular probability distribution that is symmetric around the mean. It helps in displaying that the data near the mean are more frequent as compared to the occurrences of data far from the mean. In this case, mean = 0, variance = finite value.

Program:

import numpy as np

import matplotlib.pyplot as mpl

from scipy.stats import norm

def normalDistri() -> None:

fig, aw = mpl.subplots(1, 1)

mean, vari, skew, kurt = norm.stats(moments = ‘mvsk’)

xx = np.linspace(norm.ppf(0.001), norm.ppf(0.95), 90)

aw.plot(xx, norm.pdf(xx),

‘y-’, lw = 5, alpha = 0.6, label = ‘norm data 1’)

aw.plot(xx, norm.cdf(xx),

‘g-’, lw = 5, alpha = 0.6, label = ‘norm data 2’)

vals = norm.ppf([0.001, 0.5, 0.999])

np.allclose([0.001, 0.5, 0.999], norm.cdf(vals))

r = norm.rvs(size = 2000)

aw.hist(r, normed = True, histtype = ‘stepfilled’, alpha = 0.2)

aw.legend(loc = ‘best’, frameon = False)

mpl.show()

normalDistri()

· Continuous distribution: In this type of probability distribution, all the outcomes from a given set of execution are equally possible. All the variable or values residing within the range gets the same hit of possibility as a consequence. Such symmetric probabilistic distribution gets a chance to have a random variable at an even interval, having the probability of 1/(b-a).

Program –

import matplotlib.pyplot as mp

from numpy import random

import seaborn as sbrn

def contDist():

sbrn.distplot(random.uniform(size = 1600), hist = False)

mp.show()

contDist()

· Log-normal distribution: It is a form of a continuous distribution; the log form of the variable will have a normal distribution. Programmers and statistics professionals can reconstruct the data into normal distribution from a log-normal distribution.

Program –

import numpy as np

import matplotlib.pyplot as mp

def lognormDistri():

mue, sigma = 8, 1

s = np.random.lognormal(mue, sigma, 1000)

cnt, bins, ignored = mpl.hist(s, 85, normed = True, align =’mid’, color = ‘r’)

xx = np.linspace(min(bins), max(bins), 10000)

calc = (np.exp( -(np.log(xx) — mue) **2 / (2 * sigma**2))

/ (xx * sigma * np.sqrt(2 * np.pi)))

mp.plot(xx, calc, linewidth = 3.0, color = ‘g’)

mp.axis(‘tight’)

mp.show()

lognormDistri()

· Binomial Distribution: It is the most well-known distribution technique for separating data that define the likelihood of success ‘x’ having ’n’ trial(s). The binomial distribution is popularly implemented in situations where data analysts want to extract the probability of SUCCESS or FAILURE of any data prediction. Data from an experiment, dataset, or survey has to go through several routines. A Binomial distribution executes a fixed amount of trials. Its events have to be independent & the chance of getting a failure or success must remain the same.

Program –

from numpy import random

import matplotlib.pyplot as mp

import seaborn as sbrn

def binoDist():

sbrn.distplot(random.normal(loc = 50, scale = 6, size = 1400), hist = True, label = ‘normal dist’)

sbrn.distplot(random.binomial(n = 100, p = 0.6, size = 1400), hist = True, label = ‘binomial dist’)

mp.show()

binoDist()

· Pareto Distribution: It is a continuous distribution, defined by a shape parameter, α. It is a skewed statistical distribution that is used for modeling the distribution of incomes and/or city population. It uses power law for describing quality control, social, experimental, actuarial, and different types of observable phenomena. This probability distribution focuses mainly on the larger outcome as compared to the smaller.

Program –

import numpy as np

from matplotlib import pyplot as mp

from scipy.stats import pareto

def paretoDistri():

xm = 1.4

alph = [3, 6, 14]

xx = np.linspace(0, 3, 700)

output = np.array([pareto.pdf(xx, scale = xm, b = aa) for aa in alph])

mp.plot(xx, output.T)

mp.show()

paretoDistri()

· Geometric Distribution: The geometric probability distribution is one of the special types of negative binomial distributions that deals with the count of trials needed for a single success. This probability distribution helps in determining any event that has the likelihood ‘p’ and that will occur after ’n’ Bernoullian trials. Here ’n’ is a discrete random variable, and the experiment iterates again and again until it reaches a success or a failure.

Program –

import matplotlib.pyplot as mpl

def probability_to_occur_at(attempt, probability):

return (1-p)**(attempt — 1) * probability

p = 0.3

attempt = 4

attempts_to_show = range(21)[1:]

print(‘Possibility that this event will occur on the 7th try: ‘, probability_to_occur_at(attempt, p))

mpl.xlabel(‘Number of Trials’)

mpl.ylabel(‘Probability of the Event’)

barlist = mpl.bar(attempts_to_show, height=[probability_to_occur_at(x, p) for x in attempts_to_show], tick_label=attempts_to_show)

barlist[attempt].set_color(‘g’)

mpl.show()

· Exponential Distribution: It is the probability distribution that talks about the time between different events. It determines which process from the event have occurred in a continuous fashion and independently at a constant average rate.This distribution also defines the time elapsed between events (in a Poisson process).

Program –

from numpy import random

import matplotlib.pyplot as mp

import seaborn as sbrn

def expoDistri():

sbrn.distplot(random.exponential(size = 1400), hist = False)

mp.show()

expoDistri()

· Poisson Distribution: It is one of the well-accepted forms of discrete distribution that reveals the number of times an event will possibly happen in a particular time frame. We can achieve this by narrowing down the Bernoulli distribution from 0 to any number. Data analysts implement this Poisson distribution to embrace independent events happening at a specific time interval and at a constant rate.

Program:

from scipy.stats import poisson

import seaborn as sbrn

import numpy as np

import matplotlib.pyplot as mp

def poissonDistri():

mp.figure(figsize = (8, 8))

data_binom = poisson.rvs(mu = 4, size = 4600)

ae = sbrn.distplot(data_binom, kde=True, color = ‘b’,

bins=np.arange(data_binom.min(), data_binom.max() + 1.4),

kde_kws={‘color’: ‘g’, ‘lw’: 4, ‘label’: ‘KDE’})

ae.set(xlabel = ‘Poisson Data Distrubuted’, ylabel=’Frequency of Data’)

mp.show()

poissonDistri()

Let’s Wrap Up –

Although each of these distribution techniques has its own significance and use, the most popular of these probability distributions are Binomial, Poisson, Bernoulli, and Normal Distribution. Today, enterprises and firms are hiring data science professionals for different departments, namely, various engineering verticals, insurance sector, healthcare, arts & design, and even social science, where probability distributions act as the core tool for filtering data from a parge dataset and use those data for valuable insight. Therefore, every data science professional and data analyst should know their use.

--

--

Karlos G. Ray [Masters | BS-Cyber-Sec | MIT | LPU]

I’m the CTO at Keychron :: Technical Content Writer, Cyber-Sec Enggr, Programmer, Book Author (2x), Research-Scholar, Storyteller :: Love to predict Tech-Future