ABSTRACT

Nowadays, within the period of having huge information, literary information is rapidly developing and is accessible in numerous diverse languages. Often due to time limitations, we are not able to devour all the information that is accessible. With the fast-paced world, it is troublesome to peruse all the textual content. Therefore, the necessity for content summarization comes to the spotlight. It is in this manner we are able to summarize the content so that it gets easier to ingest the data, keeping up the substance, and understanding the data. A few content summarization approaches have been presented in the past for a long time for English and some other European languages but there are startlingly few methods that can be found for the local languages of India. This paper presents a study of extractive content summarization methods for multiple Indian and international languages like Hindi, Kannada, Telugu, Marathi, German, French, etc. This paper proposes a system of Optical Character Recognition (OCR) which extracts the content from the uploaded picture. The main motive of the OCR is the creation of editable records from documents that already exist or picture files. The Optical Character Recognition also works on sentence discovery to protect a document’s structure. The paper also presents a strategy for programmed sentence extraction utilizing the Text-rank algorithm. This approach relegates scores to the sentences by weighting the highlights like term frequency, word events, and noun weight and expressions. The outcome of this work demonstrates that our approach gives more accuracy and also provides text-to-speech with the interpretation of one language to another while maintaining coherence and accomplishes superior results when compared with existing methods.

Keywords: - Natural Language Processing, Optical-Character Recognition, Summarization, Text-rank algorithm, Text-to-speech.