Welcome to

Magenic Technologies Community Blog

Sign in | Join | Help

Aaron's Technology Musings

Who let this guy on the podium?

One step closer to F# for Business Intelligence

I have been beating this drum for some time now, but the latest post from Don Syme confirms the trend.

Not only is VS 2010 going to include F# right in the box, but the scenarios that it is being targeted for are, according to Don Syme:

“In this first supported release, our aim has to be to focus on the core strengths of F# for exploratory programming with F# Interactive, programming with data and implementing parallel and asynchronous components.”

So, let’s put all three of those ideas together, exploratory programming with data implemented using parallel and asynchronous components.  If I am someone thinking about BI, this should be nothing short of nirvana in a box.  When you add to this the visualization components, and integration capabilities (i.e. easy to use components developed in F# in any .NET environment) – the implications are staggering.

It is becoming more and more evident to me that F# is the first step towards a real democratization of BI – moving BI from purely being in the realm of “stuff you do on the database” to being a proper discipline of its own, composed of good data, but adding to it sophisticated models and functional libraries that help you figure stuff out.

Human intelligence is more than just the contents of memory stored in your Hippocampus.  The ability to store and memorize facts – what we know as a database – is only a small part of the overall BI picture.  Obviously important, but by itself, doesn’t do much for you.  To really implement BI, you need to be able to imply things from facts – to build rich models of interaction – a cerebral cortex that not only acts as a digest of information, but is capable of integrating information from various sources to tell you things that will not always be obvious to your competitors.

To be fair, we have had languages on the database for awhile for BI, such as SQL, MDX and so forth.  The problem is that all these languages exist on an island.  MDX does not live anywhere but in a database, tied to a specific database product.  Even SQL, the most general purpose of languages that people use with data, is different on each data source, so much so that developers have been inventing different technologies for an entire generation to deal with the differences.

F# provides the first suite of tools available to nearly all developers (not just the “Data Dudes”) on the Microsoft platform that will allow them to do real BI work.  What is exciting to me about F# is that we will have many more tools at our exposure to deal with data.

Here is one such case.  In F#, we have an operator on the Sequence object called the pairwise operator.  Basically, what pairwise does is take data that might be in the form of 1,2,3,4, and turn it into tuples of 1,2; 2,3; 3,4.  Now, knowing this, lets say we want to do something useful, like write a script that reads web server logs to determine if a crawler is screen scraping your ecommerce site:

The pseudocode (okay, SQL from memory) to do this would look something like (and forgive me if I miss some syntax...)

--grab the 100 IP addresses that most frequently appear in the web server log

select top 100 count(*) from logentries group by IPAddress, pageRequested

--bind each row to @IP and @pageRequested, run this query on each returned row (max 100)

-- grab the IP addresses where, when the time diffence between requests is a pattern of anything more regular than "every 100 milliseconds".

select IPAddress as AddressToBlock from (select IPAddress, pairwise logtimestamp as (first, next) from logentries where IPAddress = @potentialCrawlerIP and PageRequested=@pageRequested order by logtimestamp) where (first-next) modulo 100 = 0

Now, while I am sure there are lots of mistakes in this (doing it from memory) - the biggest one is the keyword pairwise which does not exist in SQL. In F#, you use it, per the above, to generate a set of tuples that allow you to compare all previous and next pairs, something really useful in a lot of situations, particularly finding patterns in sequences of data, such as what you do when trying to find out a pattern from a web server log that relates to a crawler hitting your site.

The F# code might look something like this (including code for setting up dummy data):

#light
open System

type LogEntry = { TimeStamp:DateTime; Page:string; IP:string }

let someTestLogEntries = [|
  //these are suspect IPs - regular pattern, same address
  {TimeStamp=new DateTime(2000,1,1,0,0,0,3);Page="ProductPage";IP="127.0.0.1"};
  {TimeStamp=new DateTime(2000,1,1,0,0,0,103);Page="ProductPage";IP="127.0.0.1"};
  {TimeStamp=new DateTime(2000,1,1,0,0,0,203);Page="ProductPage";IP="127.0.0.1"};
  {TimeStamp=new DateTime(2000,1,1,0,0,0,303);Page="ProductPage";IP="127.0.0.1"};
  //these come from different addresses, no big deal
  {TimeStamp=new DateTime(2000,1,1,0,0,0,0);Page="ProductPage";IP="127.0.0.2"};
  {TimeStamp=new DateTime(2000,1,1,0,0,0,0);Page="ProductPage";IP="127.0.0.3"};
  {TimeStamp=new DateTime(2000,1,1,0,0,0,0);Page="ProductPage";IP="127.0.0.4"};
  {TimeStamp=new DateTime(2000,1,1,0,0,0,0);Page="ProductPage";IP="127.0.0.5"};
  //these come from same address, but interval is more irregular, no big deal
  {TimeStamp=new DateTime(2000,1,1,0,0,0,0);Page="ProductPage";IP="127.0.0.6"};
  {TimeStamp=new DateTime(2000,1,1,0,0,0,134);Page="ProductPage";IP="127.0.0.6"};
  {TimeStamp=new DateTime(2000,1,1,0,0,0,458);Page="ProductPage";IP="127.0.0.6"};
  {TimeStamp=new DateTime(2000,1,1,0,0,0,846);Page="ProductPage";IP="127.0.0.6"};
|]

//this function tells me, based on a time series of requests, whether they are coming in
//at regular intervals
let timeSeriesIsSuspect(series:seq<DateTime>) =
  //map the series into the millisecond component (could use ticks too)
  series |> Seq.map( fun x -> x.Millisecond )
         //arrange them into pairs based on sequence
         |> Seq.pairwise
         //filter out those where after subtracting next from prev, you get a number that mods at 100
         |> Seq.filter( fun(x,y) -> (x - y) % 100 = 0 )
         //get the length of those that meet that criteria
         |> Seq.length
         //if there is more than zero (perhaps use 1 or 2 to soften the filter) - its suspect
         > 0

//this function takes the log entries, groups them by IP and Page, filters based on suspect time series
//and generates a list of IPs to ban
let ipsToBan = someTestLogEntries
              //group by IP and Page
              |> Seq.groupBy (fun x -> (x.IP,x.Page))
              //order by number of requests
              |> Seq.orderBy (fun x -> -1 * (snd x |> Seq.length) )
              //check if time series is suspect
              |> Seq.filter (fun x -> snd x |> Seq.map( fun x -> x.TimeStamp) |> timeSeriesIsSuspect )
              //map out the IP to ban
              |> Seq.map (fun x -> fst (fst x))
           
Console.WriteLine "IPs To Ban:"
ipsToBan |> Seq.iter (fun x -> Console.WriteLine x)
Console.ReadKey()

The difference here is that I can more clearly state, in F#, intent.  Moreover, it is obvious that I can interface with others (i.e.  from a database, it will be hard to write code to ban the IP) enabling me to use my "nervous system" to act on the intelligence. As well, in F#, I have available a richer set of language constructs - all the dimensions I may ever need, the ability to extend the language, etc

The only issue is that if I have a large set of data, I wont be able to handle it in memory.  However, given the above, it is very likely that F# will ship with a data library (perhaps EF 2.0 - do not know for sure) - which will make it very simple to query large databases using F# without having to resort to data APIs.

Does this mean that BI, as we know it, is dead?  Not by a long shot.  The tooling for regular BI tied to tools that business end users can use is good, and will certainly be heavily used in the short-medium term.  However, as far as what is on the horizon, BI via F# is clearly not as far-fetched as one might have thought a mere six months ago.

Published Thursday, December 11, 2008 10:47 PM by Anonymous

Comments

# re: One step closer to F# for Business Intelligence @ Saturday, December 20, 2008 4:24 PM

You seriously think the F# samples more clearly state intent?

jdn

# re: One step closer to F# for Business Intelligence @ Sunday, December 21, 2008 1:33 PM

It can - though I will admit my example is probably not the best for that.  It would not be that hard to put a fluent interface on top of this to make it code that could be "business readable".

One could pretty easily write some extension methods over a concept called Sequence of IPs that would make it much easier to make the intent more clear.

Anonymous

New Comments to this post are disabled
Powered by Community Server, by Telligent Systems