The Word On Optical Character Recognition With PixelFlow
At The Phillips Collection Archive
November 19, 2020 | by Hannah Storch
Pixel Acuity has offered the cultural heritage community unparalleled imaging and digitization services for the better part of a decade. Recently, we have added new automations and related offerings to our repertoire. One of the most impactful innovations in Cultural Heritage imaging technology has been the ability to use the next-generation Optical Character Recognition (OCR) in our DT PixelFlow software to turn typed and handwritten documents into searchable text. Pixel Acuity is now not only able to generate the highest quality digital images for cultural heritage collections but also to create searchable texts for the researchers and scholars who access these collections, revolutionizing the way that they conduct research.
The Phillips Collection Archive
One of our ongoing projects that leverages PixelFlow’s OCR capabilities is our project with The Phillips Collection Archive in Washington, DC. The Phillips Collection houses modern and contemporary art, while The Phillips Collections Archive contains materials pertaining to the museum’s founding director, Duncan Phillips, and his wife Marjorie. The Archive holds materials documenting the purchase of important pieces of modern and contemporary art from the 1920s to present acquisitions. The current project consists of digitizing approximately 100,000 personal photographs and correspondence, pamphlets, and documents relating to the family and their work with various directors, artists, and galleries. By using PixelFlow’s OCR capabilities, The Phillips Collection Archive is able to transform their collection of typed and hand-written material into fully-searchable documents.
Optical Character Recognition (OCR) Application and Process
For our project with The Phillips Collection Archive, we are able to implement our OCR technology to create two different types of readable and searchable text files from our digital images – PDF/As and .txt files. We start by capturing the highest-quality and most consistent images of the material – the better the input the better the output – so we surpass preservation-grade digitization standards such as Metamorfoze-strict, FADGI 4-star, and ISO 19264 using RAW rapid capture photography to capture digital images. This enables us to preserve all of the information recorded by the camera sensor at the time of capture without applying compression or losing any information.
Once all of the images have been captured in the RAW format, they are ready to be run through PixelFlow in order to create the OCR’d derivatives. Due to our modern machine-learning approach, we are able to generate highly-accurate OCR’d text in multiple languages and output formats. We also have the flexibility to create a controlled, topic-specific vocabulary, depending on the needs of the collection, which can be used to further increase the specificity and accuracy of the resulting text.
The resulting data learned during the machine OCR process is then encoded into an hOCR file, which can then be converted into the deliverables requested by the client. Our unique approach enables us to offer a wide range of deliverables, including but limited to, PDF, PDF/A, a METS/ALTO sidecar xml, and txt files.
Derivatives and Deliverables
Since The Phillips Collection Archive aims to make the documents and correspondence of Duncan and Marjorie Phillips more accessible to researchers and scholars, they have opted for both PDF/As and txt files. The PDF/A format layers the OCR’d text over the image of the object and produces a document that researchers can use to search on their own devices and see matches in their original visual context, in the document itself (examples of typed and handwritten applications are pictured above). The txt file (one example is pictured right) extracts the text from the image and creates a separate file format, which can be utilized by other institutional systems such as text-analysis tools or word-cloud generators. The choice of these OCR’d deliverables, along with highest-quality preservation-grade digital images, will allow researchers to delve deeper into The Phillips Collection Archive and learn more about the history of the Museum and the relationships that formed its foundations. While it may have taken hours of painstaking research to further explore the relationship between the Phillips Collection and The American Federation of the Arts, with a simple keyword search, a researcher can now find all of the documents, both typed and handwritten, pertaining to the Federation or The Phillips with a click of a button.
It is opportunities and projects like these that allow Pixel Acuity, as a company, to innovate new workflows and adapt new technologies to give our clients the best possible digitization services and imaging experience. We continue to promote advancements, such as machine-learning-powered OCR, within the cultural heritage community because, the bottom line is that the best deserves the best.
To learn more about how Pixel Acuity and Digital Transitions can help you with digitization services, software, and consultations, please contact us.