For the past few months Nicole Friedman, until recently the data analyst at the Open Society Archives, has helped a Matchbox partner to tackle the insurmountable: get 50,000+ physical documents in order and digitized. It has been no small task, but thanks to Nicole, we’ve learned a lot on the way. So if you’ve got a digitization challenge, read on!
Digitizing documents can be incredibly useful for organizations to make physical documents accessible to others (your team, the public, etc). Some organizations receive physical documents when they request public information from their governments. This could be information on spending, allocation, etc. If the organization wants to make this information accessible to the public for stronger transparency, the documents can be scanned and uploaded to the Internet (hopefully in an organized way). This is what we mean by “digitizing” physical documents. Other organizations may want to digitize physical documents to archive evidence of abuse for future use in achieving accountability.
There are many reasons for organizations to digitize their physical documents. Below are some considerations and steps we’ve learned about this process.
Before you start: determine what information will be recorded
Start with determining exactly what information you want to capture. Think about what metadata is useful for your organization to record. Is it important to keep track of the date that the document was received? Do you need to keep track of where the physical version of this document is stored (in a file cabinet in the basement)? Make a list of these types of metadata information.
Then think about the type of information that is important for your audience. What information will they want to know? What questions will they want to answer with these documents? Identify and list these types of information. (Bonus: having this information clear at the start will help you when it comes to designing a file-naming system. See section further down on file naming!)
The importance of scanners
Choosing the correct scanner will save you time. A lot of time. There are a variety of high-quality scanners available (here’s a spreadsheet of our own scanner research) and it is important that the scanner is compatible with the work you want to do.
To help you make the right decision you may want to consider the following:
- Do you need to batch scan? Batch scanning allows you to scan a large number of documents at once.
- Think about the shape and form of the documents (is it A4 or A3 size?). Is your scanner going to be able to scan to the size you need?
- Think about why you are scanning. Are you scanning the documents to preserve them in a high-quality format (for historical archiving, etc)? If you are, then you will need a high-resolution scanner.
Scanning and pre-sets
You will need to configure your scanner to your computer and make sure you have the correct software for the scanning process. This software often comes with the scanner, but if not you can use an open source high-quality program called IrfanView (only available for Windows).
You will then need to establish the scanning pre-sets. These settings tell your scanner how you want your document scanned. The pre-sets commonly include the following: capture resolution, bit depth, file format and paper size. The pre-sets you choose will depend on your scanning needs. Do you want a high-quality image or will a lower quality do? We have included a brief outline on scanning pre-sets below:
- Capture resolution: this is measured in dots per inch, dpi. The smaller the dpi the lower the resolution and the less quality your scan will be.
- Bit depth: this tells your scanner how many colours are available when scanning the document. The higher the bit depth the more colors and shades the image will have.
- File format: this is the way in which you will store your scanned document. Common formats include, PDF, TIFF and JPEG. The format you choose will depend on what you want to do with the scanned document and the size you need the file to be.
Figuring out a file-naming convention
In order to store your scanned documents efficiently it is important to develop a clear file-naming system and stick to it. The filename is the name you give to each individual document upon scanning. An organization may scan thousands of documents. If it isn’t clear what is in a particular file, you could waste days opening each document to find what you’re looking for.
The filename should not be too long or too complicated and should include information about the document that you consider important. The filename should allow you to find a file easily when conducting a search. Types of information to include could be: the year the document was issued, a one-word description of the content and the name of organization that issued it.
Any extra information that you want to save, but not include in the filename, can be recorded in a scan log. This is a spreadsheet where you can keep a detailed record of information relating to a document or batch of documents.
Getting the best quality scan
Before scanning make sure that you inspect each document checking for pages that have been folded or any staples. If you have any poor-quality documents, do not include them in a batch scan as this will mean that your scanned documents will not be all of the same quality. Instead put them aside and scan them later. If a document is of low quality and difficult to read, you might consider using a flatbed scanner (as opposed to a batch scanner).
Once your documents are scanned check to make sure they are readable. Make sure the page was aligned with no large margins. Double check the file format and file name and make sure that the number of files corresponds to number of physical documents.
Storing your documents
Finally you should think about storage. Make sure you return your physical documents to the correct storage place. It is a good idea to use air-tight boxes and acid-free folders or envelopes. Containers should be clearly labeled and should be stored off the ground, preferably high up so as to avoid any possible water damage.
Are you or your organization facing any challenges with archiving and digitization? Do you have any suggestions or tips? Feel free to share your experiences with us below! Also, check out our list of additional resources. We love feedback so let us know what you think or if you have any resources to add.
- Sustainability of Digital Formats Planning by Library of Congress
- Recommended Formats Doc by Library of Congress
- List of Toolkits, Guides, Manuals and Guidelines from the International Council on Archives
- Collection of resources on archiving from the Responsible Data Forum