Archive for November, 2007

IE Cookie Handling Policies

This is a summary of Internet Explorer settings for handling cookies, under the so-called “Privacy” options; IE6 and IE7 are the same, although some of the wording has changed in the descriptions. It’s important to keep these in mind when issuing cookies. The Wikipedia article on HTTP Cookies outlines some of the alternatives.

  • Block All Cookies
    Blocks all cookies from all web sites from being accepted, and won’t send any existing cookies.Should be renamed “Unusable”.
  • High
    Blocks all cookies from websites that do not carry a compact privacy policy (P3P) and cookies that contain personally identifiable (contact) information. A November 2007 study shows that only about 4% of sites use P3P, so this security setting is almost as unusable as “Block All Cookies”.
  • Medium High
    Same as “High” for 3rd party cookies. Also blocks first party cookies that contain personally identifiable information.
  • Medium
    As above, but rather than “blocking” first party cookies that contain personally identifiable information, it only “restricts” them. Efforts to find the difference between “block” and “restrict” have so far been fruitless. It may mean that cookies are accepted but not sent (how useful would that be?), or that such cookies can only be used in the same web page that created them (i.e. a restriction on the domain/path components of the cookie), of that the cookies are not kept beyond the current session.
  • Low
    Same as “High” for 3rd party cookies. No restrictions on first party cookies.
  • Accept All Cookies
    No restrictions.

In contrast, Safari (MacIntosh) allows the simple options: Accept Cookies Always/Never/Only sites you navigate to. (i.e. Always/Never/Only First Party). Firefox by default allows all cookies except where specific exceptions have been defined. There do not seem to be any Firefox extensions which emulate the IE or Safari behaviour – which perhaps places into perspective the real threat that third party cookies are(n’t) in general.

References

P3P Usage Survey

Comments

Sphinx::Search 0.08 released to CPAN

I have just uploaded to CPAN the latest version of Sphinx::Search, the Perl API for the Sphinx Search Engine.

Search for Sphinx::Search on CPAN to get the latest.

Version 0.08 is suitable for Sphinx 0.9.8-svn-r871 and later (currently r909). This version fixes a couple of bugs related to error checking.

I have been asked a few times what makes Sphinx::Search different from the Perl API that comes bundled in the contrib directory of the Sphinx distribution. The bundled Sphinx.pm was used as the starting point of Sphinx::Search. Maintenance of that version appears to have lapsed at sphinx-0.9.7, so many of the newer API calls are not available there. Sphinx::Search is mostly compatible with the old Sphinx.pm except:

  • On failure, Sphinx::Search returns undef rather than 0 or -1.
  • Sphinx::Search ’Set’ functions are cascadable, e.g. you can do
    Sphinx::Search->new ->SetMatchMode(SPH_MATCH_ALL) ->SetSortMode(SPH_SORT_RELEVANCE) ->Query("search terms")
  • Sphinx::Search also provides documentation and unit tests, which were the main motivations for branching from the earlier work.

Sphinx has proven to be a very efficient and better quality search engine than the built-in MySQL full text search. It is an order of magnitude faster for large data sets and provides better options for controlling search result relevancy.

Comments (1)

Google Searches per Day

As of August 2007, Google is handling 1200 Million searches per day on average worldwide, according to a Clickz article reporting on Comscore data. Yahoo is a long way behind at 275 Million, and MSN at 70 Million. Baidu (a Chinese search engine) beats MSN, coming in at 105 Million.

2006 figures for the US only put Google at 91 Million searches per day.

In June 2007, Wikipedia received an average of 55.6 Million referrals per day from Google. Guessing that each Google search results in 2 click-throughs on average, that means Wikipedia is getting about 2% of Google’s organic traffic. Wikipedia is ranked #8 in Alexa’s traffic rankings, behind Yahoo, Google, YouTube, Live & MSN, Myspace and Facebook. Given that these probably get most of their traffic directly rather than via search engines, Wikipedia may be the most-referred site from Google.

Alexa’s top 50 is dominated by search engines and various forms of social networking sites, with Ebay, Amazon and a few porn sites thrown in.

Google lists its most popular search terms on a daily basis in its Hot Trends area. Leading up to Thanksgiving in the US, about 10% of the top 100 search terms contained the word “turkey” – turkey cooking time, turkey recipe, how long to cook a turkey, roasting turkey, turkey thermometer, turkey soup, and more. About 40% of the most popular queries related to Thanksgiving in some way.

References:

Clickz article: Worldwide Internet: Now Serving 61 Billion Searches per Month

Alexa Top 500
Wikimedia page views

Search Engine Watch 2006 data

Comments

Unicode Character Classes

These are the Unicode “General Category” character class names used in regular expression matching, e.g. in Perl, \pP or \p{Punctuation} to match all Unicode characters having the “punctuation” property.

Expression Syntax Long Name Description
Letter :L Letter Matches any letter, Ll | Lm | Lo | Lt | Lu
Uppercase letter :Lu Uppercase_Letter Matches any one capital letter. For example, :Luhe matches “The” but not “the”.
Lowercase letter :Ll Lowercase_Letter Matches any one lower case letter. For example, :Llhe matches “the” but not “The”.
Title case letter :Lt Titlecase_Letter Matches characters that combine an uppercase letter with a lowercase letter, such as Nj and Dz.
Modifier letter :Lm Modifier_Letter Matches letters or punctuation, such as commas, cross accents, and double prime, used to indicate modifications to the preceding letter.
Other letter :Lo Other_Letter Matches other letters, such as gothic letter ahsa.
Cased letter :LC Cased_Letter Matches any letter with case, Ll | Lt | Lu
Mark :M Mark Matches any mark, Mc | Me | Mn
Non-spacing mark :Mn Nonspacing_Mark Matches non-spacing marks.
Combining mark :Mc Spacing_Mark Matches combining marks.
Enclosing mark :Me Enclosing_Mark Matches enclosing marks.
Number :N Number Matches any number, Nd | Nl | No
Decimal digit :Nd Decimal_Number Matches decimal digits such as 0-9 and their full-width equivalents.
Letter digit :Nl Letter_Number Matches letter digits such as roman numerals and ideographic number zero.
Other digit :No Other_Number Matches other digits such as old italic number one.
Punctuation :P Punctuation Matches any puncutation, Pc | Pd | Pe | Pf | Pi | Po | Ps
Connector punctuation :P c Connector_Punctuation Matches the underscore or underline mark.
Dash punctuation :P d Dash_Punctuation Matches the dash mark.
Open punctuation :P s Open_Punctuation Matches opening punctuation such as open brackets and braces.
Close punctuation :P e Close_Punctuation Matches closing punctuation such as closing brackets and braces.
Initial quote punctuation :P i Initial_Punctuation Matches initial double quotation marks.
Final quote punctuation :P f Final_Punctuation Matches single quotation marks and ending double quotation marks.
Other punctuation :P o Other_Punctuation Matches commas (,), ?, “, !, @, #, %, &, *, \, colons (:), semi-colons (;), ‘, and /.
Symbol :S Symbol Matches any symbol, Sc | Sk | Sm | So
Math symbol :Sm Math_Symbol Matches +, =, ~, |, <, and >.
Currency symbol :Sc Currency_Symbol Matches $ and other currency symbols.
Modifier symbol :Sk Modifier_Symbol Matches modifier symbols such as circumflex accent, grave accent, and macron.
Other symbol :So Other_Symbol Matches other symbols, such as the copyright sign, pilcrow sign, and the degree sign.
Separator :Z Separator Matches any separator, Zl | Zp | Zs
Paragraph separator :Zp Paragraph_Separator Matches the Unicode character U+2029.
Space separator :Zs Space_Separator Matches blanks.
Line separator :Zl Line_Separator Matches the Unicode character U+2028.
Other control :Cc Control Matches end of line.
Other format :Cf Format Formatting control character such as the bidirectional control characters.
Surrogate :Cs Surrogate Matches one half of a surrogate pair.
Other private-use :Co Private_Use Matches any character from the private-use area.
Other not assigned :Cn Unassigned Matches characters that do not map to a Unicode character.

References:

unicode.org

Unicode Character Properties

Unicode Regular Expressions

Unicode Property Aliases 

Perl Regular Expressions

PCRE

Comments

Using User Agent statistics to detect website conversion problems

If it’s not tested, it doesn’t work.

This has been my long-standing mantra. Any bit of software – be it a website or other program – if it hasn’t been tested against certain inputs, then chances are there’s a bug in there that has never been uncovered. Browser nuances and incompatibilities are notorious for making sites look ugly, behave badly or simply not work at all. So unless you have tested your website against all of the major browsers, you can’t be sure that there isn’t a problem with the site causing some of your users to just go away in disgust.

It is impossible to test your website against every web browser out there…

In a recent test, more than 30,000 different “user agent” (browser) identifiers were found in a sample of one month of data from one site – and that excludes search engine crawlers and other Internet robots. Most, in fact, are Internet Explorer (IE) variants of some sort, with different combinations of plugins. Here is a typical identifier:

  Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1; .NET CLR 2.0.50727)

‘Mozilla/4.0′ identifies the browser’s basic capabilities. MSIE 6.0 is the browser version. Windows NT 5.1 is the operating system (though it is probably really Windows XP). SV1 is “Security Version 1″ as introduced in Windows XP Service Pack 2. InfoPath is a Microsoft Office plugin. .NET CLR … is the version of the .NET Common Language Runtime that the browser supports.

This is just one example; other browsers identify themselves as Mozilla/5.0, or MSIE 7.0, or Windows NT 5.0, or with other plugins/toolbars, or with multiple versions of .NET CLR, and so on. An then there are the other browsers – Firefox, Safari, Opera, Konqueror, and many more. There is a combinatorial explosion – and it is impossible to test every single possibility.

How much testing do you need to do?

Here is the mix of browser types from three different types of sites

  • Site A: a B2B site in the electronics area, likely to be dominated by technically savvy male users, looking for products for immediate purchase
  • Site B: a general homewares and gadgets type site, representing a broad cross-section of home users
  • Site C: a women’s fashion site
Browser Site A Site B Site C
MSIE (Total) 75.208% 80.174% 76.683
Firefox (Total) 21.204% 13.834% 17.859%
Safari (Total) 1.306% 4.913% 4.285%
Opera (Total) 1.347%   0.512%
Other 2.139% 2.366% 2.670%
 
MSIE 6.0 49.760% 40.461% 40.088%
MSIE 7.0 24.984% 39.374% 35.890%
Firefox 2.0 19.613% 12.792% 16.472%
Firefox 1.5 1.025% 0.607% 0.852%
Firefox 1.0 0.540%   0.511%
Safari 4xx 0.923% 3.894% 3.518%
Safari 3xx   0.507%  
Opera 9.2 1.016%    

Clearly, testing against IE 6.0, IE 7.0 and Firefox 2.0 is mandatory, these being the dominant browsers in the market. But what about Safari? It could be up to 5% of your market that you are losing if your website does not work well with Safari. And what about all of those variants – could your site test satisfactorily for IE 6.0 but be broken if the InfoPath plugin is added to the browser? Even considering the most common browser/platform/option combinations results in a substantial number of variants, and every extra platform required for testing is an additional expense – both to maintain the test suite and to run the tests on each update. Is there another way?

A Statistical Approach

Perhaps, by looking at the distribution of user agents for all the visitors on a website, and then looking at the same for those visitors that convert, it would be possible to identify any browser types that are not converting at the expected rate. This might indicate a problem with the site for that particular browser type.

This of course assumes that users are equally likely to convert regardless of which type of browser they are using. As it turns out, this assumption is completely wrong.
The following table compares browser strings at the website point of entry with browser strings at the point of transaction for site B and shows that the difference is significant.

Browser Visitors Transactions
MSIE 7.0 39.374% 45.657%
MSIE 6.0 40.461% 38.482%
Firefox 2.0 12.792% 9.207%
Safari 4xx 3.894% 3.726%

On Site B, users are generally spending their own money; one might speculate that Firefox users represent a younger (poorer) group of users and therefore convert at a lower rate, and perhaps IE 7.0 users are either cashed up and have just bought Vista, or are people who feel more comfortable transacting on the Internet because they know their computer is running the latest up-to-date software. The statistics for Site C show Firefox users converting at half the rate of other users; in contrast, for Site A (the B2B site where users are probably spending their employer’s money, and where browsers are governed by company IT policy) the conversion rates are fairly constant across all browser types.

However, the basic idea of detecting problems using statistics is sound, providing two points of comparison can be found where the assumption holds that the distribution of browsers is the same. For example, any two points within the https conversion funnel, or any two product pages, or a product page and a category page.  Or, you could take one page and compare it against the statistics for all pages in a similar section of the site.

The analysis proceeds as follows:

  1. Take your two data sets – the HTTP request logs comprising at least the user agent strings – one data set for each URL to be compared, and calculate the frequency statistics for the user agents in each data set. Perl is an ideal tool for this analysis. You will probably need about a month of data to get reliable results.
  2. Where the frequency in both data sets for the same user agent is less than about 10, add this to a user agent called “Other” and throw away the original data. Where the frequency falls below about 10, the assumptions of the chi-square test start to fail, and since this is a distribution with a very long tail, the tail errors can end up dominating the result.
  3. Perform a chi-square test to see if the distributions are different. If they are, keep going to find out where they are different.
  4. Place the resulting data in a spreadsheet – frequencies of each sample, expected values, maximum errors between expected and actual frequencies (in absolute and/or relative terms). Discard any samples where the expected values are too low to be worth considering.
  5. Sort the result by error descending. The browsers appearing at the top are causing the most trouble.

In Action

This approach was applied a tracking code implementation. A random sample of users received a cookie. The presence of the cookie was then tested on a particular page. The expectation was that the distribution of browsers should be the same with or without the cookie. The results, however, showed that the distributions were different, and the final analysis pointed to the Safari browser running on Mac to be the problem per the picture below. Tests against other sites using the same tracking code confirmed the result. Note, though, that the site wasn’t completely broken on Safari, as some users had managed to acquire the cookie. This is the type of problem that testing directly with a browser may or may not have picked up – depending on what settings the tester’s browser happened to be using.

Summary

Test, validate, test, validate, then validate some more. It is absolutely necessary to test your website with the more popular browsers, but not always sufficient. User agent statistics provide an additional test, potentially a stronger test and definitely a cheaper test compared to manual testing, for browser-specific problems.

References:

http://blogs.msdn.com/ie/archive/2004/09/02/224902.aspx

Comments

Testing whether two distributions are different

Use the chi-square test to test whether two distributions are different.

The chi-square test is

Formula does not parse: \chi^2 = \sum_i {\frac{(O_i – E_i)^2}{E_i}}

where:

O_i = observed data in bin i

E_i = expected data in bin i

The above can be used directly when comparing a set of observations with a known (expected) distribution. In this case the number of degrees of freedom is equal to the number of bins.

Given two sets of binned data A and B, the expected value in each bin of each set is its proportion of the total, i.e.:

E_i^A =  \frac{A_i + B_i}{N_A + N_B} N_A

E_i^B =  \frac{A_i + B_i}{N_A + N_B} N_B

where N_A is the total number of samples in set A, etc.

Thus the chi-square statistic is

Formula does not parse: \chi^2 = \sum_i {\frac{(A_i – E_i^A)^2}{E_i^A} + \frac{(B_i – E_i^B)^2}{E_i^B}}

which can also be written:

Formula does not parse: \chi^2 = \sum_i {\frac{( N_B A_i – N_A B_i)^2}{N_A N_B (A_i + B_i)}}

If the total number of samples in each set is the same, i.e. N_A = N_B , then this simplifies down to:

Formula does not parse: \chi^2 = \sum_i {\frac{(A_i – B_i)^2}{A_i + B_i}}

The number of degrees of freedom is (number of bins – 1).

Testing against a significance level

Choose a confidence level and look up the inverse chi square cumulative distribution for the given number of degrees of freedom, e.g. at 95% confidence and 1 degree of freedom, the threshold is \chi^2_t = 3.84 . If \chi^2 > \chi^2_t , then it can be said with the given level of confidence that the distributions differ.

Since the chi square distribution is strictly the probability that the sum of the squares of normal random variables would exceed the given value, this test should only be used when there are enough samples to assume a normal distribution. It will normally be acceptable so long as no more than 10% of the events have expected frequencies below 5. Where there is only 1 degree of freedom, the approximation is not reliable if expected frequencies are below 10.

Code Pointers

Octave – chisquare_inv, chisquare_test_homogeneity

Perl – Statistics::Distributions

Spreadsheet – chiinv

References

Numerical Recipes

Chi Square Distribution

Chi Square Test

Comments (1)

Writing equations in Wordpress using LaTeX

First, you need the wp-latex and fauxml Wordpress plugins.

latex and dvipng must be installed.

Insert equations using $latex .. $

Example $latex E = mc^2 $

E = mc^2

Syntax Reference

Wordpress Reference

Comments

Determining whether two means come from the same distribution

Useful when: you have two sets of measurements, and want to know if there has been a shift in mean value.

Use Student’s t-test for significantly different means.

Sample variance for sample A:

Formula does not parse: s_A^2 = \frac{1}{N_A – 1}\sum_{i}(a_i – \bar{a})

Similarly for sample B. Then

Formula does not parse: t = \frac{\bar{a} – \bar{b}}{s_D}

where

Formula does not parse: S_D = \sqrt{ \frac{(N_A – 1)s_A^2 + (N_B – 1) s_B^2}{N_A + N_B – 2} (\frac{1}{N_A} + \frac{1}{N_B}) }

Finding the significance level

Use the t-distribution with Formula does not parse:  {N_A + N_B – 2} degrees of freedom to compute the significance level, which is the probability that |t| could be larger, by chance, for distributions of equal means. Thus a significance of 0.05 suggests that the means are different with 95% confidence.

Testing against a significance level

For e.g. a test with 95% confidence level, find the threshold value of t at 0.05 from the t-distribution. If the computed t exceeds the threshold, the means are considered different to that level of confidence.

Code pointers

Octave – t_test

Perl – Statistics::TTest

Spreadsheet – TTEST, TDIST, TINV

References:

http://en.wikipedia.org/wiki/Student’s_t-test

Numerical Recipes

Comments