Role of Visual Signatures in performing Document Type Recognition and Classification
TIF and PDF images have been used extensively for decades as standards for document file types and are found in functional activities including contracts and agreements, document control and land records. Many of an organization’s most important documents are stored as TIF and PDF images, often with poor metadata defining what type of document the file is. As a result, risks are present related to managing security and governance of these documents because current tools rely on extracting and analyzing text from these documents, which is a lengthy and expensive process to obtain the required accuracy levels.
NeuralVision Technologies has developed novel methods to leverage computer vision and neural network technology to execute document type recognition and classification, effectively performing “facial recognition for documents”. This method does not rely on text, requires no training, and works equally well on both small and large document sets. As a completely automated process with inherently high accuracy levels, it presents a quantum shift of performance, cost and speed compared to current industry methods. Modes of use include a bottoms-up evaluation of all target documents, or a bloodhound “more like this” capability which is ideal to support security and data loss objectives.
This white paper provides a detailed review of TIF and PDF characteristics, uses and limitations in the context of the major functional risks being managed by all companies and organizations. It then explains the visual NeuralVision solution, modes of use and deployment scenarios. The following table summarizes the risk elements commonly incurred by most organizations and the solution which NeuralVision Technologies brings for mitigation.
|Risk Element||NeuralVision Solution|
|Restricted Information Data Loss||Visual signature-based tagging of restricted information across data repositories, including email system|
|Inability to Isolate PII and PHI||Visual signature-based tagging of documents known to contain PII and PHI|
|Inability to find Critical Records for Audits and Investigations||Visual signature-based tagging of critical records|
|Inability to Manage Retention||Bottoms-up review of document clusters created by self-formed visual signature groups and application of retention requirements|
|Over-Production of Data during Litigation||Bottoms-up review of target data and elimination of records which are out of scope|
|Incomplete Contracts Database||Visual signature-based tagging of documents matching a contract visual signature|
|Inability to manage Document Control||Parsing of turnover data using visual signatures to locate critical data|
TIF and PDF Images – Background
Document images are files that have a .tif or .pdf file extension, are prolific in the business world and emanate from a variety of workflows. They constitute the vast majority of documents that have undergone a print and scan process, and are found across every industry. TIF images are, by definition, always an image, which means they have no embedded text file, while PDF files can be either an image (again, with no embedded text file) or a ‘searchable’ document, meaning they either originated from a native Microsoft Office file or were generated from a scanning process. PDFs that originate from a native Office file have an embedded text file inherently, while PDFs that emanate from a scanning process may or may not have an embedded text file, depending on the options chosen during the scanning process. It is usually difficult to discern whether or not a PDF is searchable without using special software tools.
TIF images can be processed to create a PDF with an embedded text file, with a technology known as optical character recognition (“OCR”). OCR technology has been around for several decades and creating perfect text files remains an elusive goal. Because OCR produces random data loss of unknown amounts, relying on OCR’d text is a slippery slope. Even though the characters and words are visible when viewing the image, searching for a word from an OCR’d document will result in random misses because the text has been corrupted or dropped during the OCR process and is therefore not present to search against. Needless to say, most users are unaware of these intricacies and therefore assume that all files are searchable and the lack of a search result means the target results don’t exist.
The following is an example of a scanned 1980’s vintage document:
Here is the text file created by OCR’ing the document:
Note the most important words that would identify this document as a purchase order are missing, as they have been lost during the text extraction process. Therefore, using keyword and phrase searching, or any text analytics method will fail to identify this document. This document happens to be a critical record, as it contains the materials specifications used to build a pipeline asset transporting hydrocarbons.
A summary of TIF and PDF file type characteristics is as follows.
|Single Page TIF||Non-searchable image, connected to other pages in the same document using a unique document ID and an underscore page number (837382_1, 837382_2, etc.)|
|Multi-Page TIF||Non-searchable image, and all pages in the same file.|
|Can be searchable or not, depending if OCR performed (with losses if a nonnative file source). All pages typically in the same file.|
|Aggregate PDF||Searchable or not. Multiple documents back-to-back inside a single file.|
TIF and PDF Image Use Case Examples
Contracts and Agreements
Consider the example of generating a contract or agreement; workers typically start with a MSWord template and edit the document to their needs. Once that version is ready and approved for use, the authorized person will print, sign, scan and email to the counterparty for execution. The counterparty will open the PDF file, print it, sign it, then scan again and email back to the sender. Each time the document is scanned additional text loss is introduced if and when the document is OCR’d in the future.
Since contract and agreement generation and execution occurs in many departments and by multiple workers, keeping track of all those files is challenging. Attempts to gain control of this activity across the span of the organization is met with resistance, especially where rogue employees buck the system and wish to keep their activities under local control and off the corporate radar. Since many contracts and agreements span years in length and have auto- renewal and evergreen provisions, in-force documents can go back many years.
At the same time, corporate legal and procurement are attempting to coral all of these agreements and put into a common system. Extensive data mining is often performed on these documents to build a database which summarizes all of the key terms and provisions, in order to optimize spending, facilitate negotiations and manage risk. Key events including litigation and M&A necessitate a thorough understanding of the corporation’s holistic contractual responsibilities. Incomplete contracts repositories become an Achilles heel.
TIF and PDF images make the contracts management process harder, starting with identifying where all of the contracts and agreements are located across the organization.
Document control is a common function in many industries including construction, manufacturing, energy and pharma. It entails the fulfillment of contractual obligations by a third party, usually a construction or equipment vendor performing work to engineer and build facilities, equipment and hardware for their client. Contracts stipulate what is to be delivered, the specifications, drawings, and other design and construction information. Vendors typically complete work in phases and turn over completed work product commensurate with those obligations. The turnover includes extensive documentation, typically in the form of PDF documents. Often the documentation is compiled into large, multi-page and multi-document PDFs which are hundreds of pages in length, with a sparse or non-existent index or inventory of what’s included. Recipients of this documentation are faced with parsing these files to find and isolate key documents such as materials specifications, performance data, user manuals and design drawings. Client recipients typically don’t require better organization as part of the contract and therefore spend inordinate amounts of time and effort figuring out what they received and determining whether or not it met the requirements of the contract. Inevitably there is litigation and due diligence preparation for the complaint is exhaustive, requiring hundreds or thousands of man hours to prepare for.
TIF and PDF images inhibit the document control process by making critical records harder to find using conventional text analysis methods.
Mortgage Backed Securities
Mortgage backed securities (“MBS”) involve the packaging of individual loans in tranches of assets which are then securitized and sold to investors. Each loan has a required number of documents, and includes documents recorded in the county courthouse as well as forms such as HUD-1 settlement statements. All of these documents are TIF and PDF files and among the worst quality you will find, due to the variant methods used to scan the original paper records by multiple parties during the transaction. The lack of controls around document generation, document quality and indexing makes auditing and forensic analysis very challenging, requiring armies of people to review, benchmark, and index these records. Triangulation of recorded instruments to loan performance and underwriting databases and documentation is a nightmare.
Risks Associated with TIF and PDF Images
All companies possess what is considered “Restricted Information” from a security standpoint; the classical definition of such being that leakage of which will cause material harm to the organization. Restriction information commonly includes such topics as intellectual property, design information, architecture, strategic and proprietary data, trade secrets, source code, financials, patents and contracts. In addition, companies must worry about protecting documents which contain personally identifiable information (“PII”) and protected health information (“PHI”) which are obligations under State and Federal laws and regulations. Much of the data listed above is resident in TIF and PDF images, often with poor or missing descriptions and indices which make them obvious. Rogue employee behavior would include taking a Restricted Information-containing document and creating a TIF image (if not already) and emailing to their personal email. If the company doesn’t know where these documents are and hasn’t tagged them for what they are, there is no way to prevent this behavior. Similarly, IT departments can’t firewall off these documents to the general employee population if they have no visibility into their existence. Cyber threat success is a heightened risk as well, as hackers gain access to poorly controlled Restricted Information.
As one might imagine, the ability of an organization to find and protect Restricted Information is flawed as a function of this TIF and PDF image problem, exposing them to significant risk on several fronts.
In addition to the security issues, companies face information governance challenges around managing retention of records in accordance with corporate polices, whereby policies dictate which types of records must be retained and for how long, and actively managing destruction of records which have met their retention requirements. Over-retention of records incurs unnecessary operational costs, but more importantly exposes the company to inflated discovery and production of data costs in the event of litigation and investigations. If records were eligible for destruction but retained anyway, they are in-scope for discovery. Ample case studies have documented the excess costs incurred due to this phenomenon.
TIF and PDF images inhibit governance because they make it hard to identify what kind of record they are.
Current State of Text-Based Analytics
People have generally become spoiled by search engines, based on our infatuation with smart phones and the web in general, and have come to rely on the instant gratification that is delivered by virtue of the vastness of the internet. The expectation of getting high quality results fast and deep allows us to speed along at a quick pace and never wonder or worry about what’s being left behind.
Search engines work, first and foremost, by having access to text; without text there are no results. Take the Google Index, for example, which is a giant database housed by Google containing every web page, blog, and index-able article since the beginning of the web. On top of the text are more tools such as those which guess what you’re looking for and suggest words and topics based on natural language and a compiled history of what has been searched for in the past. Since many people are searching at once, it’s feasible to compile the statistics of what searches are popular (“trending searches”), and what is the sentiment of the blogs and tweets around those searches.
In the context of this whitepaper, the reader should take away the fact that searches are only as good as the availability of the underlying text. Special tools can be used to allow for a certain amount of text omission or corruption; for instance, having OCR’d text where the letter ‘t’ was replaced with the number ‘7’ and the letter ‘o’ with the number ‘0’, resulting in “c0rrup7i0n”. Very few people will hold their mouth right to figure this out and fewer still have access to the custom advanced search tools needed to screen out these errors.
Text clustering technology is interesting because it avoids the large set of tasks associated with building and testing training models needed to classify documents. Text clustering has two basic types; 1- Literal Sameness – where its looking for versions of the same document, with identical word patterns and in the same sequence, and 2- Semantic Similarity – where synonyms are used to interpret meaning even if the same words aren’t used. For example, an ‘automobile architect’ and ‘car designer’ have the same meaning. Text clustering typically requires significant volumes of historical content to train on to prime the pump, which means it doesn’t work well on smaller batches of data.
Machine learning is a sub-category of artificial intelligence and has many different types and applications, ranging from self-driving cars to benchmarking network traffic patterns, to classifying text. Like search engines and clustering, it is dependent on good quality text to perform its job.
In the context of document recognition and classification, it usually follows clustering whereby clustering is used to identify training documents for the machine learning software. For each type of document that is desired to be classified using machine learning, 50-100 examples are needed to achieve critical mass training. In the case where there are dozens, hundreds or thousands of document types, the training function becomes a massive activity.
When you introduce the TIF and PDF image problem to both the text clustering and machine learning process, bad things happen.
- Because of the poor text, the images get pushed around randomly and end up in unknown places as pure errors.
- Precision (the measure of false positives) and recall (the measure of false negatives) range from 60-80%. This compares to a target of 99%.
- This means the cost of performing quality control on the output is extensive and expensive and barely more effective than brute force coding by humans.
Again, TIF and PDF images constitute some of the most important, toxic and restricted use documents that an organization possesses and which they seek to manage the most closely for all of the reasons discussed here.
Typical Workflow and Costs to Achieve High Accuracy Levels
For a typical application using text-based clustering followed by machine learning, there are a number of steps in the workflow. The following table identifies all of the resource requirements, end-to-end, that are needed including the cost of software, support from IT, subject matter experts (“SME”) and document review QC personnel.
|Unknown Composition TIF Document Collection|
|Batch Size, Documents:||100,000|
|Pages per Document, Average||3.5|
|Task||Resource||Metric||Quantity||Unit Cost||Batch Cost||Notes|
|Load and Ingest Images||IT||Hours||5||$60||$300|
|Execute Batch and Monitor||IT||Hours||5||$60||$300|
|Stage Data for Text Analytics||IT||Hours||5||$60||$300|
|CLUSTER TEXT FILES|
|Choose Clusters for Submission to Machine Learning Process||SME||Hours||5||$60||$300|
|BUILD/TEST MACHINE LEARNING MODELS|
|Machine Learning Software||Vendor||Document||100,000||$0.02||$2,000.00|
|Create Initial Models||SME||Hours||20||$60||$1,200|
|Test and Iterate Models||SME||Hours||20||$60||$1,200|
|Initial QC||Data Analyst||Hours||20||$30||$600|
|Coding Program Configuration||IT||Hours||5||$60||$300|
|First Pass||Coding Specialist||Hours||278||$15||$4,167||10 sec/doc|
|Second Pass||Coding Specialist||Hours||278||$15||$4,167||10 sec/doc|
|Elapsed Linear Time, Hours||671|
Takeaways from this illustration include the following.
- The software costs (OCR, clustering, and machine learning) are only about 20-30% of the overall cost.
- There are numerous tasks required by IT to move the work through the multiple stages of activity.
- QC costs comprise 40-50% of the overall cost of the project.
- Some of the costs are non-linear (such as model building, unless the number of document types expands with volume), while other are linear to volume (software, QC).
- Elapsed times to complete all the tasks in the list is weeks (QC can be a multi-person function, depending on the availability of trained staff).
Introduction of Visual Signature Technology
Visual Signature – Defined
Visual signatures are analogous to a human brain remembering what a document looks like, and using that intelligence to look for other documents which look the same. These signatures are algorithmically generated and are based on known examples of target data provided by the client, as well as discovered versions uncovered during processing.
Visual signature technology leverages a form of artificial intelligence called computer vision to generate the signatures as well as a form of deep learning technology called neural networks. NeuralVision Technologies has developed software incorporating these capabilities to perform document recognition and classification in a simple-to-use process that has three main steps as described in the following diagram.
The process of generating this digital signature, is consistent and immune to the inherent variability of text-based methods, which look for the presence of common words and word frequencies. As previously discussed, for TIF and PDF images requiring OCR, the useful words may not even be available for analysis. In our process, the visual signature is an immutable and consistent entity which has the appearance to the human eye in the example below.
Each type of document has its own visual signature, with individual distinguishing characteristics. Below are some examples of the visual signatures for different document types.
Modes of Use
These visual signatures are compared to each other and to prior signatures, with prior signatures imparting their document type tag to new signatures being processed. In a steady state process, few new types are found, and batches of data needing primary document type classification are processed and returned.
There are two ways to use NeuralVision technology.
- Feed NeuralVision examples of electronic documents (file types including TIF, PDF, MSOffice and JPEG) which are confirmed to be on the target list for DLP. This would include such content as contracts, financials, design documents, intellectual property and other types of data which the organization seeks to protect. Our software generates visual signature models representing those documents. Execute the “seek” command to locate other documents which match the visual signature model.
2. In addition to or as an option to the above, let the NeuralVision software self-organize the entire population of untagged documents on the fly. Users can browse these clusters of same document types and select and tag those documents as appropriate. Tagged documents can be added to the training sets and used for future scans as necessary.
NeuralVision technology works equally well on small and large document collections, and persists tags to future data sets.
NeuralVision technology can be used either as a managed cloud service, leveraging our SOC2- certified cloud computing environment, or as an on-prem command line-driven sub-process. Visual models are binary entities and are portable, within or across projects. For email systems, NeuralVision can process the attachments to messages and determine if that attachment matched the visual signature of a known security threat.
About NeuralVision Technologies LLC
NeuralVision Technologies is a software company headquartered in Boston, Massachusetts whose founders have decades of experience in unstructured data applications.
For more information, please visit our website at www.neuralvision.net or contact Brent Stanley, CEO, at 617.455.8184 or email@example.com.