Qanat / قناة

Creating an Open-Source OCR Pipeline for the Textual Traditions of the Islamicate World

Introducing OpenITI's Open-Source OCR Pipeline: Qanat / قناة

The textual production of the diverse “Islamicate” cultures stretching from North Africa to South Asia is one of the most prolific in human history. These vast textual traditions, however, remain woefully understudied and are particularly underrepresented in the Digital Humanities. One of the primary reasons for this lacuna is the lack of an OCR solution that delivers high quality OCR results in an open-source and user-friendly platform. The Qanat project will address this issue through the creation of an open-source OCR pipeline focused on Islamicate languages (beginning with Persian and Arabic in this grant period, but expanding to Syriac, Turkish, and Urdu in the coming years). Our Qanat prototype is powered by the combination of three different components: the Kraken ibn Ocropus OCR program, the Nidaba OCR pipeline, and the crowdsourcing/distributive task platform Pybossa. We are currently in the process of testing the Qanat pipeline and we hope to release a beta version for limited public use in the summer of 2017. For more information on our OCR tests, please see the working paper that we released in the fall of 2016.