from entropy to cross entropy the math behind

September 13, 2024

© 2024 borui. All rights reserved. This content may be freely reproduced, displayed, modified, or distributed with proper attribution to borui and a link to the article: borui(2024-09-13 23:06:26 +0000). from entropy to cross entropy the math behind. https://borui/blog/2024-09-14-en-from-entropy-to-cross-entropy-math.
@misc{
  borui2024,
  author = {borui},
  title = {from entropy to cross entropy the math behind},
  year = {2024},
  publisher = {borui's blog},
  journal = {borui's blog},
  url={https://borui/blog/2024-09-14-en-from-entropy-to-cross-entropy-math}
}

The reason of this article is that during background studying for one of my research on fitting machine learning models to unbalanced data, I found no disscussion of how cross entropy got its name.

To illustrate this, we have to dig deeper into the math behind and find out the definition of entropy in cross entropy.

from entropy to cross entropy

In information theory, given a discrete variable x{\displaystyle x}, which takes values in the set/space X{\displaystyle {\mathcal {X}}} and is distributed according to p(x)[0,1]{\displaystyle p(x) \in [0,1]}, xXp(x)=1\sum_{x \in {\mathcal {X}}}p(x)=1, the entropy is:

H(X)=xXp(x)logp(x)H(\mathcal {X})=- \sum_{x \in {\mathcal {X}}} p(x)\log{p(x)}

The choice of base for the logarithm varies for different applications. Base 2 gives the unit of bits (or "shannons"), while base e gives "natural units" nat, and base 10 gives units of "dits", "bans", or "hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable.

Simply put, the choice of base represents the units used to encode the all the probabilities in an event. And the entropy, by the given choice of base, is how many of this units (bits, nats, etc.) needed to encode all these probabilities.

Given p(x)p(x) is the actual probability mass function (PMF), q(x)q(x) is the correspondant predicted PMF, cross entropy is:

H(X)=xXp(x)logq(x)H(\mathcal {X})=- \sum_{x \in {\mathcal {X}}} p(x)\log{q(x)}

In information theory, the cross-entropy between two probability distributions p and q, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution q , rather than the true distribution p.

Kullback–Leibler divergence and cross-entropy minimization

The definition of cross entropy may be formulated using the Kullback–Leibler divergence, divergence of p from q (also known as the relative entropy of p w.r.t. q).

H(p,q)=H(p)+DKL(pq),{\displaystyle H(p,q)=H(p)+D_{\mathrm {KL} }(p\parallel q),}

In mathematical statistics, the Kullback–Leibler (KL) divergence is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. For discrete probability distributions P and Q defined on the same sample space X to be

DKL(pq)=xXp(x)logp(x)q(x)DKL(pq)=xXp(x)logq(x)p(x)D_{\mathrm {KL} }(p\parallel q)=\sum_{x \in {\mathcal {X}}} p(x)\log{\frac{p(x)}{q(x)}} \\ D_{\mathrm {KL} }(p\parallel q)=- \sum_{x \in {\mathcal {X}}} p(x)\log{\frac{q(x)}{p(x)}}

Cross-entropy minimization is frequently used in optimization and rare-event probability estimation. When comparing a distribution q against a fixed reference distribution p, cross-entropy and KL divergence are identical up to an additive constant (since p is fixed): According to the Gibbs' inequality, both take on their minimal values when p = q, which is 0 for KL divergence, and H(p){\displaystyle \mathrm {H} (p)} for cross-entropy.

Proof of minimum function value by Gibbs' inequality

If I understand right, general cross-entropy cost function can be written as:

H(X)=xXp(x)logq(x)H(\mathcal {X})=- \sum_{x \in {\mathcal {X}}} p(x)\log{q(x)}

where vector t is 'true' discrete pdf and the vector a is the predicted pdf for the current input. Is it easily provable that p(x)=q(x),xXp(x)=q(x), \forall x \in \mathcal{X} minimize the cost?

Suppose that P={p1,,pn}P=\{p_{1},\ldots ,p_{n}\} and Q={q1,,qn}Q=\{q_{1},\ldots ,q_{n}\} are discrete probability distributions. Then

i=1npilogpii=1npilogqi{\displaystyle -\sum _{i=1}^{n}p_{i}\log p_{i}\leq -\sum _{i=1}^{n}p_{i}\log q_{i}}

with equality if and only if pi=qi,i{1,2,,n}p_{i}=q_{i}, \forall i \in \{1, 2, \cdots,n\} Put in words, the information entropy of a distribution P is less than or equal to its cross entropy with any other distribution Q.

The difference between the two quantities is the Kullback–Leibler divergence or relative entropy, so the inequality can also be written:

DKL(PQ)i=1npilogpiqi0.{\displaystyle D_{\mathrm {KL} }(P\|Q)\equiv \sum _{i=1}^{n}p_{i}\log {\frac {p_{i}}{q_{i}}}\geq 0.}

Note that the use of base-2 logarithms is optional, and allows one to refer to the quantity on each side of the inequality as an "average surprisal" measured in bits.

asymmetricity and alternatives

It is obvious that the Kullback–Leibler (KL) divergence is asymmetric, since if we swap q and p, the sign of the logarithm will change and the mutiplying factor will change from one to ther other. Kullback & Leibler (1951) also considered the symmetrized function:

DKL(PQ)+DKL(QP){\displaystyle D_{\text{KL}}(P\parallel Q)+D_{\text{KL}}(Q\parallel P)}

which they referred to as the "divergence", though today the "KL divergence" refers to the asymmetric function (see § Etymology for the evolution of the term). This function is symmetric and nonnegative, and had already been defined and used by Harold Jeffreys in 1948; it is accordingly called the Jeffreys divergence. This quantity has sometimes been used for feature selection in classification problems, where P and Q are the conditional pdfs of a feature under two different classes. In the Banking and Finance industries, this quantity is referred to as Population Stability Index (PSI), and is used to assess distributional shifts in model features through time.

An alternative is given via the λ{\displaystyle \lambda }-divergence,

Dλ(PQ)=λDKL(PλP+(1λ)Q)+(1λ)DKL(QλP+(1λ)Q),{\displaystyle D_{\lambda }(P\parallel Q)=\lambda D_{\text{KL}}(P\parallel \lambda P+(1-\lambda )Q)+(1-\lambda )D_{\text{KL}}(Q\parallel \lambda P+(1-\lambda )Q),}

which can be interpreted as the expected information gain about X from discovering which probability distribution X is drawn from, P or Q, if they currently have probabilities λ\lambda and 1λ1-\lambda respectively.

The value λ=0.5\lambda =0.5 gives the Jensen–Shannon divergence, defined by

DJS=12DKL(PM)+12DKL(QM){\displaystyle D_{\text{JS}}={\frac {1}{2}}D_{\text{KL}}(P\parallel M)+{\frac {1}{2}}D_{\text{KL}}(Q\parallel M)}

where M is the average of the two distributions,

M=12(P+Q).{\displaystyle M={\frac {1}{2}}(P+Q).}

references

https://en.wikipedia.org/wiki/Entropy_(information_theory) https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence https://stats.stackexchange.com/questions/488032/minimum-value-of-cross-entropy-cost-function https://en.wikipedia.org/wiki/Gibbs%27_inequality#Proof