sbutler.com

Data Mining the Financial Markets

April 25th, 2008

Thomas A. Rathburn has written a series of three articles on data mining the financial markets. Rathburn takes a detailed look into the success and failures of his efforts in the markets and with 10 year US bonds in particular. You can check it out here part 1, part 2, and part 3. The articles are also available as a podcast here: 1, 2, 3.

[via KDnuggets]

Experian Bolsters Data With Hitwise Acqusition

May 4th, 2007

Tim O’Reilly points to the news that Experian has made a significant move to improve the quality of their online and demographic data with the acqusition of Hitwise for US$240 Million. Hitwise collects user traffic from ISPs in several countries including Australia and uses that information to provide companies with insight into their online marketshare. Although not mentioned in the press release, the Hitwise data will likely be a huge boon for Experian’s marketing services, and will probably allow them to develop more accurate geo-demographic profiles.

Winning the DARPA Grand Challenge

September 17th, 2006

Sebastian Thrun of Stanford Racing gives a great a talk on what it took build an autonomous vehicle to win the DARPA Grand Challenge. There are lots of cool technical details on the use of machine learning to achieve this. You can watch it on Google Video here.

In-cell Graphing

August 11th, 2006

The guys from Juice Analytics have put together an interesting series on in cell graphing (parts 1, 2, & 3). This is a feature that is due in the upcoming version of Excel 2007, however the technique the Juice guys use works across all versions of Excel and is quite visually appealing too. Added bonus, I can confirm it works in OpenOffice.org, Gnumeric and even Google Spreadsheets (all to varying degrees).

Article: HCF gets a helping hand from predictive analytics

June 13th, 2006

From the ComputerWorld article:

Private health insurer HCF has implemented a predictive analytics suite to help weed out fraudulent claims, target individual members and streamline the monotonous labour of data analysis.

Data Mining with Oracle

May 30th, 2006

If you are interested in data mining and haven’t already seen the Oracle Data Mining and Analytics blog, it is worth checking out. It has some great how to’s, including time series forcasting (parts 1, 2, 3) and real-time scoring & model management (parts 1, 2, 3).

Smart SPAM & Fighting it

May 13th, 2006

For any machine learning based SPAM filters, such as the popular Bayesian methods, the key to success is the body of previously identified SPAM and HAM (valid emails) or training data. In order for the spammer to trick the filter, they must try to be more HAM-like. The way to beat this is by giving your email classifier as much training data as possible, and continually updating it. Just learning from your company’s emails is probably not fool-proof when you consider the volume and variety of SPAM on the net. Web-based email on the other hand, like Gmail and the hosted version, should never have this problem because the filter learns from thousands of user’s SPAM folders.

Researchers from University of Calgary claim that the next evolution of will be smart SPAM, which will infiltrate your computer via spyware/viruses and ‘mine’ your emails. By creating emails based on the your actual messages you’ve previously sent, the spammers hope they will be more believable to readers.

I would argue, however, that such a situation would merely make services Gmail, more attractive. Firstly because they have a truly massive body of knowledge to use to fine tune their spam filters, and secondly because it is unlikely such spyware could infiltrate a web-based system. Even if a program was distributed that waited for someone to log on and then took over, Google could have it effectively neutralised in a matter of hours.

Data Mining Cup 2006

May 5th, 2006

The Data Mining Cup (DMC2006), has launched for 2006. This year the competition focuses on eBay auctions. The target is to predict for each new auction whether the actual sales revenue is higher than the average sales revenue of the product category.

DARPA Grand Challenge

May 4th, 2006

Start your engines, the DARPA Grand Challenge is on again only this time its an urban challenge! The last two competitions were to race an autonomous vehicle through a desert, with the 2005 winner, Standford, taking home a US$2 million prize.

stanford1.png stanford2.png
Stanford’s software in action: Input from GPS and many sensors feed the algorithms to determine the safe path (see tech report).

Using Gmail for Backups

May 3rd, 2006

While writing a thesis it is obviously imperative to have foolproof backups in place. So why not backup to that free 2.7Gb Gmail account? Here’s what you have to do:

  1. Install “email” (Gentoo users: emerge net-mail/email)
  2. Edit /etc/email/email.conf (Gentoo users: as a minimum you must set REPLY_TO)
  3. Test the commands. They are:
    cd /path/to/your/thesis/
    tar -czf /tmp/thesis.tar.gz *.*
    email --blank-mail --smtp-server mail.yourserver.com –from-name “your name” –from-addr you@youremail.com –subject “Cron: Thesis Backup (`date`)” you@gmail.com –attach /tmp/thesis.tar.gz > /dev/null 2>&1
    rm -f /tmp/thesis.tar.gz
  4. Now add this as a /etc/crontab entry. This example sends the backup at 7am each day.
    0 7 * * * unixusername cd /path/to/your/thesis/; tar -czf /tmp/thesis.tar.gz *.*; email –blank-mail –smtp-server mail.yourserver.com –from-name “your name” –from-addr you@youremail.com –subject “Cron: Thesis Backup (`date`)” you@gmail.com –attach /tmp/thesis.tar.gz > /dev/null 2>&1; rm -f /tmp/thesis.tar.gz
  5. Final step is to create a Gmail filter! It would be nice if it was possible to stop the emails being downloaded via POP but I think this may require a filter that moves the incoming backup emails to Trash.

Obviously you don’t have to use this for backing up a thesis, it could easily be modified to backup whatever you want.
Note: I can’t see mention of TLS support in the client email, so that’s why I’ve suggested you use your own SMTP server rather than Google’s.