One step closer to F# for Business Intelligence
I have been beating this drum for some time now, but the
latest post from Don Syme confirms the trend.
Not only is VS 2010 going to include F# right in the box, but
the scenarios that it is being targeted for are, according to Don Syme:
“In this first supported release, our aim has to be to focus
on the core strengths of F# for exploratory programming with F# Interactive,
programming with data and implementing parallel and asynchronous
components.”
So, let’s put all three of those ideas together, exploratory
programming with data implemented using parallel and asynchronous
components. If I am someone thinking about BI, this should be
nothing short of nirvana in a box. When you add to this the
visualization components, and integration capabilities (i.e. easy to use
components developed in F# in any .NET environment) – the implications are
staggering.
It is becoming more and more evident to me that F# is the first step towards a real democratization
of BI – moving BI from purely being in the realm of “stuff you do on the
database” to being a proper discipline of its own, composed of good data, but
adding to it sophisticated models and functional libraries that help you figure stuff out.
Human intelligence is more than just the contents of memory
stored in your Hippocampus. The ability to store and memorize
facts – what we know as a database – is only a small part of the overall BI
picture. Obviously important, but by itself, doesn’t do much for
you. To really implement BI, you need to be able to imply things
from facts – to build rich models of interaction – a cerebral cortex that not
only acts as a digest of information, but is capable of integrating information
from various sources to tell you things that will not always be obvious to your
competitors.
To be fair, we have had languages on the database for awhile
for BI, such as SQL, MDX and so forth. The problem is that all
these languages exist on an island. MDX does not live anywhere
but in a database, tied to a specific database product.
Even SQL, the most general purpose of languages that people use with data, is
different on each data source, so much so that developers have been inventing
different technologies for an entire generation to deal with the
differences.
F# provides the first suite of tools available to nearly all
developers (not just the “Data Dudes”) on the Microsoft platform that will allow
them to do real BI work. What is exciting to me about F# is that we will have many more tools at our exposure to deal with data.
Here is one such case. In F#, we have an operator on the Sequence object called the pairwise operator. Basically, what pairwise does is take data that might be in the form of 1,2,3,4, and turn it into tuples of 1,2; 2,3; 3,4. Now, knowing this, lets say we want to do something useful, like write a script that reads web server logs to determine if a crawler is screen scraping your ecommerce site:
The pseudocode (okay, SQL from memory) to do this would look something like (and forgive me if I miss some syntax...)
--grab the 100 IP addresses that most frequently appear in the web server log
select top 100 count(*) from logentries group by IPAddress, pageRequested
--bind each row to @IP and @pageRequested, run this query on each returned row (max 100)
-- grab the IP addresses where, when the time
diffence between requests is a pattern of anything more regular than
"every 100 milliseconds".
select IPAddress as AddressToBlock from (select IPAddress, pairwise logtimestamp as (first, next) from logentries where IPAddress = @potentialCrawlerIP and PageRequested=@pageRequested order by logtimestamp) where (first-next) modulo 100 = 0
Now, while I am sure there are lots of mistakes in this (doing it from memory) - the biggest one is the keyword pairwise which does not exist in SQL. In F#, you use it, per the above, to generate a set of tuples that allow you to compare all previous and next pairs, something really useful in a lot of situations, particularly finding patterns in sequences of data, such as what you do when trying to find out a pattern from a web server log that relates to a crawler hitting your site.
The F# code might look something like this (including code for setting up dummy data):
#light
open System
type LogEntry = { TimeStamp:DateTime; Page:string; IP:string }
let someTestLogEntries = [|
//these are suspect IPs - regular pattern, same address
{TimeStamp=new DateTime(2000,1,1,0,0,0,3);Page="ProductPage";IP="127.0.0.1"};
{TimeStamp=new DateTime(2000,1,1,0,0,0,103);Page="ProductPage";IP="127.0.0.1"};
{TimeStamp=new DateTime(2000,1,1,0,0,0,203);Page="ProductPage";IP="127.0.0.1"};
{TimeStamp=new DateTime(2000,1,1,0,0,0,303);Page="ProductPage";IP="127.0.0.1"};
//these come from different addresses, no big deal
{TimeStamp=new DateTime(2000,1,1,0,0,0,0);Page="ProductPage";IP="127.0.0.2"};
{TimeStamp=new DateTime(2000,1,1,0,0,0,0);Page="ProductPage";IP="127.0.0.3"};
{TimeStamp=new DateTime(2000,1,1,0,0,0,0);Page="ProductPage";IP="127.0.0.4"};
{TimeStamp=new DateTime(2000,1,1,0,0,0,0);Page="ProductPage";IP="127.0.0.5"};
//these come from same address, but interval is more irregular, no big deal
{TimeStamp=new DateTime(2000,1,1,0,0,0,0);Page="ProductPage";IP="127.0.0.6"};
{TimeStamp=new DateTime(2000,1,1,0,0,0,134);Page="ProductPage";IP="127.0.0.6"};
{TimeStamp=new DateTime(2000,1,1,0,0,0,458);Page="ProductPage";IP="127.0.0.6"};
{TimeStamp=new DateTime(2000,1,1,0,0,0,846);Page="ProductPage";IP="127.0.0.6"};
|]
//this function tells me, based on a time series of requests, whether they are coming in
//at regular intervals
let timeSeriesIsSuspect(series:seq<DateTime>) =
//map the series into the millisecond component (could use ticks too)
series |> Seq.map( fun x -> x.Millisecond )
//arrange them into pairs based on sequence
|> Seq.pairwise
//filter out those where after subtracting next from prev, you get a number that mods at 100
|> Seq.filter( fun(x,y) -> (x - y) % 100 = 0 )
//get the length of those that meet that criteria
|> Seq.length
//if there is more than zero (perhaps use 1 or 2 to soften the filter) - its suspect
> 0
//this function takes the log entries, groups them by IP and Page, filters based on suspect time series
//and generates a list of IPs to ban
let ipsToBan = someTestLogEntries
//group by IP and Page
|> Seq.groupBy (fun x -> (x.IP,x.Page))
//order by number of requests
|> Seq.orderBy (fun x -> -1 * (snd x |> Seq.length) )
//check if time series is suspect
|> Seq.filter (fun x -> snd x |> Seq.map( fun x -> x.TimeStamp) |> timeSeriesIsSuspect )
//map out the IP to ban
|> Seq.map (fun x -> fst (fst x))
Console.WriteLine "IPs To Ban:"
ipsToBan |> Seq.iter (fun x -> Console.WriteLine x)
Console.ReadKey()
The difference here is that I can more clearly state, in F#, intent. Moreover, it is obvious that I can interface with others (i.e. from a database, it will be hard to write code to ban the IP) enabling me to use my "nervous system" to act on the intelligence. As well, in F#, I have available a richer set of language constructs - all the dimensions I may ever need, the ability to extend the language, etc
The only issue is that if I have a large set of data, I wont be able to handle it in memory. However, given the above, it is very likely that F# will ship with a data library (perhaps EF 2.0 - do not know for sure) - which will make it very simple to query large databases using F# without having to resort to data APIs.
Does this mean that BI, as we know it, is dead? Not by a long shot. The tooling for regular BI tied to tools that business end users can use is good, and will certainly be heavily used in the short-medium term. However, as far as what is on the horizon, BI via F# is clearly not as far-fetched as one might have thought a mere six months ago.