by Lin Yangchen
        Laboratory of Computational Philately
        Coconut Academy of Sciences



previous section | next section | back to table of contents


This research is in press.

The catalogue listings were digitized by optical character recognition using the Tesseract library in the R Language for Statistical Computing. Then text string matching was used to find and correct systematic errors, remove unneeded text and extract the denominations and catalogue values. The output was quite accurate, with only minor errors like the occasional missing decimal point or wrong character, for example “S” instead of “5” or “3” instead of “8”. The catalogues were in different formats and fonts, requiring slightly different code and spawning different errors. Everything had to be cross-checked by eye, but the computer still saved time as it automated much of the process over the many sets of coconut definitives.

References

previous section | next section | back to table of contents







Powered by SmugMug Owner Log In