Statistics and Machine Learning

September 1, 2010

In the world of search and affiliate marketing, a lot of data is generated. I mean a LOT. The modern web, you have people clicking on links, opening emails, logging in to sites, filling out forms, getting ads in their browsers, pixels firing, tracking cookies reporting in, and a hundred other ways that data gets generated and stored.

Most of this data is generally seen as worthless - or worth only what it’s original purpose was; eg, a tracking pixel said that this person completed this form, so pay our affiliate. Some companies know better, and they mine this data for it’s clickthrough potential in developing targeting heuristics. These companies are able to charge crazy amounts for their campaigns, which translated into making loads of money.

So, how can you take your data and make it into something useful? My friend, the answer therein lies in statistics and the magic of machine learning.

Basic Probability

Say you have a list of names, cars, and zip codes:

Name	Car	Zip
Bob	Honda	95343
Joe	Ford	64023
Sally	Toyota	17648
Bob	Honda	95345
Tim	Honda	58384
Bob	Honda	58845
Joe	Honda	85438
Bob	Ford	76845
Joe	Ford	34782
Bob	Toyota	23114
Tim	Honda	57438
Bob	Ford	39824

(This data set is not that great for a few reasons - it’s very small, and statistics work better on large data sets, and it’s arbitrary so it doesn’t actually mean anything, but for our example it will be ok.)

So the first step is to understand how probabilities work, and are calculated. To calculate the probability (notated as P) of any one data item, like P(name=Tim), you take and count how many Tims you find (2 matches), and then count how many rows in total were searched (12 rows).

P(name=Tim) = 2 / 12 = .166...

It’s something to note that probabilities will always be a number between 0 and 1.

So now that we know the probability that Bob occurred in this data set, if this data set was a good representation of all the data, then we could safely assume that Tims make up 17% of the world. By running the numbers, then we would also be able to assume that 25% percent of people are Joe, and that 50% of cars are Hondas.

We also know something else, we know what percent are NOT Tim. This is found by taking 1 - P.

P(^name=Tim) = 1 - P(name=Tim) = .833...

Thus, 83% of people are not Tim.

Probability Averages

So, taking in to consideration our above list, we can now consider probability averages and using them to discover things about our data.

To discover the average probability for a column of data, we calculate the probability for each row, add them up, and divide by the number of rows.

P(name) = P(name=Bob) + P(name=Joe) + P(name=Sally) + P(name=Bob)+
          P(name=Tim) + P(name=Bob) + P(name=Joe) + P(name=Bob) +
          P(name=Joe) + P(name=Bob) + P(name=Tim) + P(name=Bob)
        = ( 6 * 1/2 + 3 * 1/4 + 2 * 1/6 + 1 * 1/12 ) / 12
        = .3472...

So the average probability distribution is about .3475 The fact that there are only 4 names in the list means that an even distribution (1/4) would be .25. The difference between these numbers shows that there is a very heavy trend for some value or values to be very high or low, amongst other things. Independence

Some data elements have a dependence on other data elements, while others are nearly independent. Knowing how each element relates to the others is important for figuring out how to deal with data.

To figure out if there is a strong or weak dependency between two data elements within a data set, compare the results of the multiple ( P(A)*P(B) ) probability to the joint probability ( P(A and B) ).

P(name=Bob)*P(car=Honda) = 1/2 * 1/2 = 1/4

P(name=Bob and zip=Honda) = 3/12 = 1/4

This hints that it may not just be Bobs’ that like Honda, but everyone.

P(name=Joe)*P(zip=64023) = 1/4 * 1/12 = 1/48

P(name=Joe and zip=64023) = 1/12

This hints that there is a relationship between Joe and the zip code 64023. If we take the average for each of these values across our data set and compare them, then the overall distribution can be determined. Conditional Probability

Conditional probabilities describe relationships between data. So, lets say we wanted to find what portion of Bobs’ drove Hondas (represented as P(car=Honda|name=Bob), probability of a car given a name). This is calculated by counting the number of Bobs’ with Hondas (3 matches), divided by the total number of Bobs (6 matches). The opposite principle also applies here, so to find out the probability that Bob doesn’t drive a Honda, it would be 1 - P.

This calculation can also be done by dividing the joint probability by the prior probability for a value ( P(A|B) = P(A and B) / P(B) ).

P(car=Honda|name=Bob) = 3 / 6 = 1/2
or
P(car=Honda|name=Bob) = 1/4 / 1/2 = 1/2
and
P(^car=Honda|name=Bob) = 1 - 1/2 = 1/2

This tells us that 50% of Bobs’ have Hondas. Likewise 33% have Fords and 17% have Toyotas. If the variables are independent, then the conditional probability is simply equal to the base probability; eg, P(A) = P(A|B).

Using Probability

So, what use is all this information? It can be useful in several different ways.

Analytics

The primary traditional use for statistics is for analytics purposes. If you want analytics, I would suggest you hire a statistician.

Predictive

So we are going to try to sell Bob a car (or present Bob with an offer). If we take a look at what kind of cars that Bobs’ already have, then we can say ‘Bobs’ tend to drive Hondas’ and present them with that opportunity.

Classification

We see that someone named Bob has walked in and said he lives in zip code 95345. Based on a comparison of a probability calculation comparing the chance that it is a particular instance vs that it isn’t, a heuristic can be developed to classify this person as a previously known person.