Print

Documents analysis and automatic content extraction

Course Instructor: Costin Boiangiu

Syllabus:

Document Image Analysis Systems transform “on-paper” legacy documents into digital ones in order to store, multiply, make them more accessible, and to make the content available to applications like text mining and text to speech.

Students will learn about the algorithms used in processing the document: importing and preprocessing the imported image (by noise removal, contrast enhancement, skew removal, binarization), Layout analyzing for doing block classification (letter, word, paragraph, title, subtitle, image, white space, page number, …), Hierarchy analyzing for arranging the objects in a logical order (book title, author, preface, chapter, table of contents, ...), OCR for actually transforming from raster image to digital text along with post-processing for correcting word and grammar mistakes, Exporting to a layered format like PDF or MRC.

The course has three main teaching methods: presentation of the theory, held by the teacher; presentations based on cutting-edge research articles sustained by the students; and a project developed by the students, starting from proposed ideas, in-progress projects, or a new idea which students can propose.