Magic Lamp Software

IBM Datacap Capture – Straight talk about Straight Through

Posted on December 8th, 2017 Blog

As data capture capabilities continue to improve and become more viable in handling documents that were previously determined to be too complex for automated capture, more organizations are getting their first exposure to data capture technology.  On top of that, as more organizations are considering automated data capture, sales organizations are scrambling to meet the added demand and sometimes fail to set the proper expectations of what automated data capture can do, how it works, and what is the measure of success you should expect.

One question I am hearing more and more is “What is your straight through processing (STP) rate?”.  What people mean by that, is if you scan 100 documents, how many of them will go into their back-end system (repository) without any human involvement.

When I hear that question, in most cases, I know the expectations of the customer have not been properly set.  Straight through processing is a very poor and misleading measure of capture performance.  I have seen sales presentations where the entire ROI is computed solely by how many documents go straight through, and in my opinion, there is nothing more useless in calculating system success or savings as calculating with an STP (Straight Through Processing) rate.

Wait, did you say misleading and useless?

Yes.  It is misleading and useless because it concentrates on only a (usually) small part of the overall system benefit.  If, for instance, I tell you that I would expect 30% of your documents to go straight through a system without operator intervention, one would tend to ignore the fact that most of the data was also captured on the 70% of the documents that didn’t go straight through.

It is like a salutatorian considering his/her academic years completely wasted because that one A- in first hour gym class kept him/her from delivering the commencement day speech.   Success is not only measured in perfection.  I love Mizzou football, consider them, most years, as very successful, yet they have never been the national champion.  (Next year, maybe…)

Okay, so what *is* Datacap’s straight through processing rate?

A skeptic, huh?  That is okay, I respect that.  Being from the Show-me state myself, I admit I owe you more of an explanation.

Simple answer: “Your straight though rate will be whatever you want it to be.”  0% to 100%, and anything in-between.  It is totally configurable in Datacap systems.

An only slightly better question would have been, “How many documents go through your system, no-touch (another term for STP), and have 100% accuracy?”

The answer to that question gets a bit more involved, but those rates would be dependent on basically two things; the amount of data you are capturing from the document, and the quality of the image that we are capturing from.  No two systems that I have worked on have ever been the same.  Some have very high STP rates, like one job that was averaging 330,000 documents per day where we were only reading a barcode from the document.  There were, on average, only 4 or 5 documents per day that failed and someone had to look at to correctly capture them.

On something like an invoice with perhaps pages of line item detail, where there are hundreds, thousands, or tens of thousands different characters to recognize, the straight through rate would probably be low.  But, on the ones that didn’t go straight through, perhaps typing half a dozen keystrokes allows the data to be corrected—as opposed to typing ALL of those characters without OCR.  That is where the savings are, and those savings are completely overlooked by focusing on an STP rate.

The BEST capture systems flag all erroneously recognized data for operators to view, with a minimum of flagging of things that are correct.  This is where the Datacap rules engine excels.

Okay, you are starting to make sense.  What else can you tell me?

OCR engines are 100% straight through.  Without software like IBM Datacap Capture, OCR errors will go into your repository.  The function of Datacap is to STOP straight through processing when errors are detected, or if the confidence associated with even a single character on the document is low.  In those cases, we show it to an operator for them to verify.  That way, we deliver 100%, or very close to it, to your repository.

In some applications, certain things that are being captured may be deemed with less importantance than other data.  An example would be a description of an item on an invoice.  Perhaps, in that case, “Part No 22763 Capture Software – 1BM Datacap” is just as acceptable as “Part No 22763 Capture Software -IBM Datacap”.  If it isn’t in a search, and anyone that sees it in the future would unlikely get confused what it means, then why spend the effort (money) to correct minor mistakes.  The confidence required for a field to flag the document for verification is configurable by field.  We can also flag by the NUMBER or PERCENTAGE of low confidence characters in a field if that deemed more appropriate for the data being captured.

Mistaking a 1 for an I in an invoice number or employee number could be a big deal, however, because when someone is using that for search criteria in their repository, it won’t be able to pull up the document at all, and the document is essentially lost without altering the search to find the document another way.

Datacap’s rules engine can alter confidence to cause it not to flag data if the data is deemed to be correct by another method.  For example, if the quantity of an item is 7, the unit price is $2.00, but the total gets recognized as $14.00 with a low confidence, the confidence of the total can be raised because the quantity * the unit price was checked, and there is a very high probability that the line total is correct.  Where it would normally have flagged the document for verification, if that was the only field that required validation, your STP rate just went up.

There are lots of things we can do to programmatically raise or lower confidence, based on your needs, and we can even automatically correct data with MagicLamp’s Cognitive Correction ™.  It is brand new technology that I hope to write a detailed article about very soon.

Remember this:  STP gains are very small.  STP risks are very high.


STP gains are small because even if we stop and show you a document that we have a doubt about a character or two, in a matter of seconds your operator hits a button and moves on to the next document.  In data capture, saving minutes each on many documents is more important than saving a couple of seconds on a few.

STP risks are high because if we DIDN’T stop to show you the document, and one of those characters was, in fact, wrong, then bad data goes into your repository.  If that bad data was on a critical field (let’s say an Invoice Number) and some vendor calls to ask about the invoice, searching the repository will not find it.  And, if that happens, maybe the person researching tells the vendor that their invoice has not been received.  They resend the invoice, adding extra work at best, perhaps the system didn’t even catch it the second time, and you STILL can’t find a record of it.  Researching items after the fact is expensive.

One time, I ordered a book that I fondly remembered from my high school years (Alas Babylon, by Pat Frank, if you are interested) that had a new electronic version available for me to read on my e-reader.  I was very excited to dig into it, but it was so full of OCR errors I was quickly disappointed.  I did finish it, disappointed, and called the seller to complain about the quality and had the price quickly refunded.  Go to Amazon and check out the reviews.  I don’t find any that mention the electronic version where OCR errors or proofreading is not also mentioned in the review.  I am not generally a reviewer, but if you gaze the many one-star reviews, you can try to guess which one is mine.  Good book, though.  Buy the hard copy.

Now, I can imagine whoever got this electronic version out in record time may have gotten a bonus for their “outstanding success”, but I am sure that was short lived.  I expect many got their money back, and this publisher’s reputation has undoubtedly suffered from that “outstanding success”.  Personally, I now read the reviews before ordering any older books that would have used OCR. When I do, I burn a mental effigy of whoever put in that capture effort.

You have convinced me.  What question should I really ask?

You should ask how much overall data-entry effort is saved.  That is the critical question, and the sole factor that that should be used in computing an ROI.  In most jobs where there is a significant amount of data captured from a document, it should be 60-80%.  I have had quite a few higher, even the big barcode only job that was very nearly 100%, but in general, those are numbers that you should be thinking.  If it will fall below that, the technology may not be available yet to handle your requirements and the solution may not be viable… yet.  Be patient, our capabilities grow every release.

Engaging an experienced professional and having a frank discussion about your documents, the data you want to capture, and how to get the fastest ROI from your investment is the key to a successful project.

Written by: Tom Stuart, Vice President of Development, MagicLamp Software Solutions