Super-fast fuzzy search & super slow cooking.

one-fishermanLong time fishing buddy, Paul, put me on to  Rafi’s Spicebox.

OK – that fish in the picture is far too big to be one of Paul’s, but the hat is just about silly enough to be right.

The spices from Rafi’s are all pre-mixed and ready for adding other ingredients of your choosing, and each pack is labelled with instructions. I’ve made a couple already, and the Malaysian Beef Rendang at the weekend, was very good indeed.

They arrive as mix sachets and need a kilo of veg or meat or both; and are best cooked in a slow cooker foclovernoder around eight hours. You need to either dish it up to a family or freeze what’s left over.

On fast fuzzy searching, I refer you to the left hand clover leaf labelled Identity Resolution. This is where we tie up data from an incoming document (a scanned invoice maybe) to a record on a host computer (a supplier from the supplier table, or an order – it’s all agnostic and driven by parameters).

If the information to be matched has come from an unstructured document, by scanning and OCR, probably, we need to be able to check the text in the document against the host tables in order to resolve the identity of the document.

This presents a whole host of problems: the addresses may not match ordinally word for word, some words may be lost in the OCR or mangled and others may just be spelled incorrectly. Depending on the quality of the incoming document, whole batches might never match with equality searching, and need approximate matching as a matter of course.

This is where Fuzzy Matching comes into the equation. It is not a new technology, but the downside is, that because it is doing so much processor intensive work, it can be quite slow, even against small target data sets, and the only way to get them matched is with a server process that might take several minutes to process a batch. So, you just sit and wait.

I have been focused on real time fuzzy matching with one million addresses to match against. I have chosen 1M addresses as the test set because it exceeds the largest of our customers source tables many times over, and I thought that if I could get to a 10 second match (i.e. 1 second per hundred-thousand addresses) then we would be looking at very quick server matching or even real time identity resolution for our customers.cyberbeast

In testing, on a quick PC, I achieved a match against 1M in under 2 seconds. The impact on the CPU was a quick spike and it sipped RAM, even though I ran the code in one thousand threads.

This puts real-time Identity Resolution and Matching firmly on the table for Softology’s Clover Node. Mind you, I had to fall back on skills I learned thirty odd years ago to cut the code in X86 Assembly.

That was when Information Technology Co. MD Paul (my fishing buddy) of  Antar and I raced to produce as much IBM 370 ibm370Assembly code in as short a time as possible: a baptism by fire.

Now, we’re very happy to fish in a highly non-competitive way, although my fish are much bigger than his.

And the music? I only nailed that fast, fuzzy search last Monday, so I’ve been feeling pretty triumphant, consequently I am listening to King King – raunchy guitar driven blues and rock infused ballads. Love it!

Advertisements