How to create a knowledge repository based on PDF reports

New approaches in data storage allow organisations to move to an electronic environment capturing all raw data from instruments. The use of an industry standard format such as the portable document format (PDF) not only allows capturing reports in an electronic format, but also can improve compliance where necessary. PDF reports, says Freek Varossieau, can build a knowledge repository.

In a previous article in eLab magazine, we have addressed the tremendous growth in data, primarily for non-structured data sets (reports, analytical data, images etc.). Applications generating these data sets are also often capable of printing reports based on the interpretation of the experiment, another unstructured representation of the experiment. Human beings highly appreciate an immediate and comprehensive view of that information in printed format. We recognise complex structures in a blink of an eye.

Having said this, we automatically stumble on the limitations of a paper based system or organisation. Paper is acceptable, as long as you exactly know where you have stored the reports and what it contained. Can you still find back an old report after a view months? Can you easily compile a batch record based on all the paper records? Even worse, what do you know of your neighbours work? Easily indexing and finding back information is not a trivial task with paper records.

A solution to that problem lies in the correct use of reports in PDF format. This approach is a good marriage between printed format and the ability to find and retrieve reports.

An application that allows you to store not only raw data, but also capture, index and store PDF reports is the CyberLAB Knowledge Engineering System. It provides a central repository for all generated data in the organisation while maintaining data integrity.

Report creation

The creation of PDF reports is an easy task. There are many PDF writers available. However, these do not automatically create a knowledge repository. CyberLAB provides a software application named CyberPrinter. CyberPrinter allows for PDF creation that can be automatically stored and indexed into the CyberLAB system. Users can either create PDF reports manually or set-up the CyberPrinter in such a way that reports are automatically captured from instruments and sent to the repository. The latter is often used for those applications that handle batches or sequences of samples. By automating these steps the creation of your knowledge repository is simple and effective.

One of the characteristics of unstructured data is that it is difficult to search. Unlike a strict database environment, where information is nicely stored in tables and records, unstructured data can be searched with difficulty and sometimes not at all. Searching for batch number AB-1234 in reports manually requires opening and looking at each report. This is not the apromised land' of the paperless laboratory. So, the challenge lies in creating structure out of achaos'.

One way is to dump the complete text of the report into a single record in the database. It is simple, straight forward and all the information is immediately available for searching. It is an excellent approach for large text reports containing extensive information. However, we have still not created a structure of the document; we only created a new, unstructured dataset.

To obtain a structured dataset it must be possible to extract specific information from a PDF report. CyberLAB allows specific data extraction from PDF reports in various ways. It provides a template extraction tool allowing users to define regions of interest containing information to be extracted (Fig. 1). The way information is located in the PDF reports may vary from very simple to complex, where information is not necessarily on the same location all the time or might vary in size. The versatile CyberLAB extraction template uses key words to identify and extract the necessary information in a structured manner. Once an area has been defined, the extraction template automatically extracts and stores information in the CyberLAB database. In the example given in Fig. 2, an extraction template is defined to obtain the Batch ID AB-1234. It clearly demonstrates the ease of use in extracting information from any given type of report in PDF format. In this way any information coming from any source can be stored and indexed automatically.

Data mining

Based on the data stored in the database, relationships between data can be investigated. In addition, it provides a single source of all information created in the organisation. This in contrast to the known legacy systems nowadays used where information is stored in different databases, forcing users to collect data from different sources manually. Fig. 3 shows a list of meta data extracted from a PDF template. All information is readily available for internal as well as external programs. In browsing through the database the user will gain access to available information in the organisation.

Some examples are:

* Traceability of batches through production.

* Trend analyses of key performance indicators.

* Multi variant analysis.

* Graphical display of relations.

* Immediate access to original data as often requested for drug registration.

* Enhanced text searching with ie Oracle Intermedia, for afuzzy logic' and asounds like' queries.

Information stored should not be limited; the dynamic database structure of CyberLAB allows any data in any quantity to be stored and retrieved. Based on the available information cross references can be made and information displayed in a fashionable manner.

Advantages

Working in a diverse environment as the laboratory, one can easily imagine that the information spawned from many instruments will give rise to an archiving problem in the long term.

Capturing data in a neutral format like PDF brings many advantages to the organisation. It creates a single point of access to all data generated in the organisation.

And with the new FDA interpretation of 21 CR Part 11, the use of paper is still allowed for those records not (often) used in regulatory affairs. Allowing the organisation to use PDF reports for regulatory affairs combined with powerful indexing capabilities of CyberLAB propels efficiency while maintaining compliance.

Enter 82 or at www.scientistlive.com/elab

Freek Varossieau is with Scientific Software International BV, Willemstad, The Netherlands. www.scisw.com

Recent Issues