Fw: [eGovINDIA] Indian Language Processing Tools

Friends (especially those in Africa):
India has made some advances in local language software. Please see the
material appended below. Africa can benefit from India's knowledge, rich
experience and expertise. At the Africa Regioanla Meeting held at Accra
(preparatory to the WSIS at Tunis), there was a session on language software
and I drew the attention of Prof. Adama Samasekou to the possibility of
Africa and India working together in this area.
I also suggested that C-DAC and other Indian institutions could be
contacted. Incidentally, as part of the Million Books project, Prof. N
Balakrishnan of the Indian Institute of Science, Bangalore, and his team
have developed some very good translation software. I-Network Uganda might
wish to discuss these issues in one of its meetings.
Best wishes.
Arun
[Subbiah Arunachalam]
----- Original Message -----
From: "eindiagov"
Indian Language Processing Tools http://tdil.mit.gov.in/nlptools/ach-nlptools.htm
Internet Tools and Technologies Developed by the respective Institutes/Organizations Sanskrit Processing Tools Developed by the respective Institutes/Organizations Sanskrit Authoring System Desika Software Shabadbodha Software Spell checkers
Internet Tools and Technologies Developed by the respective Institutes/Organizations
Java based Solutions : IIT Kanpur
Displaying Web Documents through Negotiation and Dynamic Rendering : Web authors create documents in a variety of languages using a variety of character sets and fonts. It is not possible for the viewer of the document to have all those fonts and character sets present on his system. Thus either the client is required to download the fonts or install these on his/her system or install some software on his system to help in the process. For a truly portable solution, the client need not specially install any fonts or software on his system. The Java-centric solution for displaying the Devanagari documents has been developed. Java Applet using public domain font `Bhagwan' displays the Devanagari documents in true type font. The applet and the font related information is around 100k. The Server Java Applet encodes the glyphs and sends it with the document so that Hindi Font is not required at the client side for browsing the document. Hindi Search engine has been developed on Linux platform.
Automatic Font Installer : The users are required to generally download the fonts for viewing the non-roman web sites or manually install the fonts, which is a cumbersome task. A single executable has been developed to carry out the process whenever the user chooses the font installation option. The font installer program runs on the server and installs the fonts on the client machine.
URL Addresses in Devanagari : The current browsers don't allow URL and filling of forms in HTML pages to be entered in Hindi. Therefore, the world wide web needs to be multi-lingualised. It involves conversion of URL in the local language to the internationally agreed format, such as UNICODE, with UTF-8 encoding. The user will be able to fill the forms in the document fetched in a local language. Further, the user is allowed to choose between a set of languages in which to view the document, if the document is available in various languages. The solution is based on the Swing components in JDK1.2 and assumes the font to be present on the local machine.
Indian Language Search Engine : The search engine should allow indexing and searching of Devanagari HTML documents. The basic components are gatherer, indexer, and Search processor. Indexer and Search processor are being designed as these two modules deal with syntax and semantics of the language of the text. Indexer will perform processing such as keyword extraction, stop word removal, stemming (handling different forms of a word), handling of word synonyms, and term weight calculation. Search Processor looks up the index to find the documents containing the query keywords, calculate a relevance score, and ranks them according to the score. This search engine will also search the keywords occurred in a composite word (combined according to `SANDHI' rules, for example s/w will give a match for keyboard `ram' if it finds `rameshwar' in the document). It is assumed that the documents are in ISCII.
Heritage Website : The website has been developed containing material concerning traditional Indian texts centred around the `Upnishads and the Bhaagwadgita'. The functionality provided is: There are font-downloading applications (.exe) files. Users can download any Indian language font and install the font. The commentaries include some technical words, which have Sanskrit origin. The definitions for such technical words are displayed with a mouse click. Links between related shlokas within the same Upnishad/Gita for cross referencing Each hyperised text has a search mechanism by which user can locate the occurrence of any word and view that particular shlokas with it's translation and commentary. User selected language for viewing the Mool Shlokas
CD Authoring Tools for Indian Language Documents : The technologies that are being used for publishing on the www viz. HTML, XML, Java, Javascript etc. are also being increasingly used for document delivery over CDs. These days the entire computer related documentation is accessible using the web browsers. It is expected that this trend will pervade into non-technical publishing also. The development of Indian Language CD Publishers ToolBox, `site management' tools and searches integrated with a dictionary are underway.
ActiveX based Solutions - C-DAC, Pune
Web based E-mail :Hindi e-mail service, has been developed which uses advanced ActiveX technologies available with Internet Explorer 4.0 and later versions of browsers for enabling the keyboarding and fonts for Indian languages on the client PC. This service provides a facility to type the text in Hindi language for sending an e-mail in Hindi which gets converted into HTML format. This converted Hindi text in HTML and with font codes is delivered at the Email address of the user, who can just place it on any Web Page using any standard HTML editor like Netscape Composer. The software components namely ActiveX Controls and Hindi Fonts get downloaded and installed on the client's computer when the user first time accesses the system. Every time the user accesses the e-mail server, a check is made for the installed components. To be able to send/view mail, the user must first create an account on the system by defining login name and password. An account holder on the system can send a message to any other account holder. The user types in the message using the Inscript keyboard overlay. The message is stored in a database. The data on the server is stored in ISCII. When a request for reading an e-mail message is received, the server retrieves the message from the database and creates a HTML file containing the message with the Hindi font information on the fly and delivers it. Microsoft Visual Interdev is the IDE, which uses the power of ASP (Active Server Pages) to make web pages and connect to the back end Database using ODBC drivers for MS SQL server. Using ASP, queries have been made to the database from the webpage. ActiveX technology based Hindi e-mail, search engine and Bulletin Board System has been developed. Hindi e-mail also stores documents in ISCII format.
Hindi Bulletin Board System : It is under development. This web based application allows users to create topics for discussion and maintains threads within a topic.
Hindi Search Engine : Development is underway which involves the following:
Manually surf the net and build indexes for documents in Hindi Invite the Hindi language document creator to submit web pages URL with page description and keyword in Hindi to the search engine i.e. build a web based application to collect data in Hindi Build special search techniques for Hindi based on word morphology/thesaurus/sandhi etc. Deliver HTML document index description in Hindi for search result. Define standards for Meta-tags etc. for Indian languages such that future spiders can retrieve documents for a particular language
Multi-lingul E-mail Client : A working prototype has been developed to facilitate the clients for sending and receiving e- mails in Hindi without having need to have Internet connection provided sender and receiver both have this s/w. The application will use technologies like MAPI, Extended MAPI and COM to communicate with the interfaces provided by MS Exchange. The application downloads all the mails received via the POP3 server and stores them locally on the machine. This storage of mails is taken care of by the MS-Exchange. The application provides access to various folders like InBox, Sent Mails, Deleted Mials and OutBox for convenience of the user.
Sanskrit Processing Tools Developed by the respective Institutes/Organizations
DESIKA - Centre for Development of Advanced Computing (C-DAC), Bangalore
The Software package, DESIKA is a Natural Language Understanding System for Sanskrit. This software incorporates language generation and analysis modules for plain and accented written Sanskrit texts. It is based on the principles of ancient Indian Sciences. DESIKA aims to process all the words of Sanskrit, includes generation and analysis (parsing), has an exhaustive database based on Amarakosha, the most popular Sanskrit lexicon, rule base using the grammar rules of Panini's Ashtadhyayi and heuristics based on Nyaya & Mimamsa sastras for semantic and contexual processing. This software can also analyse Vedic (scriptural) texts. The highlight of DESIKA is the analysis module which is a general purpose Sanskrit parser currently being extended to handle compound and combined word forms dissolution and identification. Vedic analysis is also under way. Rigveda and Taittiriya branch of Krishna Yajur Veda analysis using Taittiriya pratishakya and Vaidika Prakriya of Ashtadhyayi. The DESIKA software helps in understanding a natural language input (typically an isolated sentence) through paraphrasing, voice change, query answering or summarising, to develop a language- independent knowledge representation scheme based on ancient Indian Sciences, to develop tools for linguistic analysis and to assist in analysis & presentation of scriptural (accented text) knowledge, phonetic and language research, teaching etc., It was developed on DOS platform and has now been ported on Windows platform.
Sanskrit Authoring System - C-DAC, Bangalore
Sanskrit word processor is under development which will even handle special Sanskrit conjucts. The requirements which will be catered by this s/w are:
Word Processing in Sanskrit Statistical Tools like concordance, thesauri, electronic dictionaries etc. Transliteration Facility Search/Sort Algorithms Word Split Programs for Sandhi and Samasa Fonts for various scripts, web access, web hosting, publishing etc. Poetry Analysis (Textual/metric/statistical) Manual content for Amarakosha, Grammar rules, Derivations, Quotes from Vedas (scriptures) Epic like Ramayana, Mahabharata, other Puranas, Shastraic texts in sutra and Authentic Reference On-line readers/primers of Indian Shastraic texts Tools for morphological, syntactic and semantic analysis Tools for linguistic analysis like tagging, lemmatising, statistical studies etc.
Syntactic and Semantic Analysis of Sanskrit Sentences - Academy of Sanskrit Research, Melkote
Software for syntactic and semantic analysis of Sanskrit sentences has been developed on DOS platform with GIST card and is being ported to Windows platform. The sentence has been considered the basic unit for analysis since it is the backbone of verbal communication between the human beings. The importance of words will be known only when the meaning of sentence is known. Systematic classification of words and a robust grammar can help in deriving the knowledge from Sanskrit and build a system which will help in the development of Natural Language Processing Systems.The various modules of the system are:
Subanta: It can handle generation and analysis of all the case inflected forms of more than 26,000 stems. Tinanta: It can handle the conjugational forms of roots, in two voices, ten lakaras and three modes viz. Kevala Tiganta, Nijanta and Sannanta. Krdanta: It is capable of handling generation analysis and identification of case inflected forms of 11 types of krdantas of 150 roots. Databases: 690 Avyayas, 26, 000 Nominal stems, 600 Verbal roots, krdanata forms of 600 verbal roots, 5 Taddhita suffixes.
The parts of speech handled for analysing are nouns, pronouns, adjectives, participles, Indeclinables, Indeclinable participle and verbs. Sentences with multiple adjectives and participles can also be analysed. Sentences constructed by picking up any words from the database can be syntactically analysed. But semantic analysis is done within a limited domain. For handling the semantic analysis, a matrix has been prepared which consists of 52 sets of nouns with their synonyms amounting to 300 nouns, 27 actions denoted by nearly 200 verbs. Syntactic and semantic analysis of simple passage consisting of not more than 10 simple sentences has been done successfully.
Computer Assisted Sanskrit Teaching & Learning Environment (CASTLE) - Jawahar Lal Nehru University, New Delhi
CASTLE s/w on DOS platform with GIST card has been developed for Sanskrit teaching and learning as a stand-alone application. Under this project, the synthesis aspect of Sanskrit phonology and word morphology has been handled. The various modules developed under this system are:
Pratyahara: It deals with the sound classes of Paninian grammar. It may be described as a shorthand notation to refer to a group of items. Sandhi: Euphonic combination relating to sound units is called `sandhi'. It is a common module for various types of word formation. A sandhi type depends on the final phoneme of the first word and the initial phoneme of the second word. It also includes a program for internal sandhi called Natva-satva vidhana. Subanta: This module deals with the nominal inflexion. The System inputs are noun base with its attributes and the output is the 21+3 inflected forms of the noun. Tiganta: Verbal conjugation is called tiganta. This module takes the verb and lakara (tense/mood) as inputs, and generates 9 conjugated forms of the verb in each pada. Kridanta: The primary derivatives are called kridanta. The inputs to the system are the semantic condition, verb root and krit suffix. The kridanta form is the output. Taddhita: The secondary derivatives are called taddhita. System inputs are the semantic condition, noun base and taddhita suffix. Taddhita form is the output. Samasa: Compound formation is known as samasa. Two or more words are joined to form a new word. The inputs to the system are two or more noun bases, which are characterized by a semantic condition, and the normal suffixes. Sri-pratyayas: These suffixes are added to primary verbal roots to derive secondary verbal roots. The derived verb is again sent to the tiganta module to generate 9 conjugated forms of the verb in each pada.
Following Demonstrative modules for learning/teaching of Sanskrit have also been developed:
Teaching Varnmala: This module deals with the teaching of Sanskrit alphabet alongwith their characteristics. Exercises for testing knowledge of Varnmala have also been prepared. Sandhi Viccheda: The system takes a word as input, and returns the constituent words. Subanta Viccheda: The input word is split into the root word and suffix. Besides, the grammatical attributes associated with the root word, i.e. the noun-base are also displayed. Tiganta Viccheda: The input word is split into the root word and suffix. Besides, the grammatical attributes associated with the root, which is a verbal root, are also displayed.
Sanskrit Authoring System
Sanskrit Authoring System including a Sanskrit word processor for use by Sanskrit scholars in text processing etc is being developed at C-DAC, Bangalore.
Desika Software
This Software package is a Natural Language Understanding System for Sanskrit, developed at Indian Heritage Group of the Centre for Development of Advanced Computing (C-DAC), Bangalore, a Scientific Society of the Ministry of Information Technology, Government of India, Ramanashree Plaza, 2/1, Brunton Road,Bangalore (Karnataka). This software incorporates language generation and analysis modules for plain and accented written Sanskrit texts. It is based on the principles of ancient Indian Sciences. DESIKA aims to process all the words of Sanskrit, includes generation and analysis (parsing), has an exhaustive database based on Amarakosha, a the most popular Sanskrit lexicon, rule base using the grammar rules of Panini's Ashtadhyayi and heuristics based on Nyaya & Mimamsa sastras for semantic and contexual processing. This software can also analyse Vedic (scriptural) texts.
Shabadbodha Software
Shabdhabodha is an interactive application built to analyse the semantic and syntactic structure of Sanskrit sentences. It works on MS-DOS Platform version 6.0 or higher with GIST shell. It has been developed at ASR, Melkote.
Spell checkers
Spell checkers are useful for word processing and are mostly integrated with the word processing softwares. Spell checkers in few Indian Languages are available. The development of Spell checkers is covered within the scope of the current projects for corpora development. Punjabi Spell-checker has been developed has been developed at CEDTI, Mohali.
Special requirements for Indian Language Processing (ILP)
India is a large multilingual society with as many as eighteen constitutionally recognised languages including English and the National language is Hindi. There are multiple scripts for these languages. With increase in trade and development across the country it becomes necessary for the people to communicate in more than one language. In such circumstances, Information Technology(IT) appears to be a promising tool for the development of ILP systems which aim at overcoming the language barrier. These ILP tools could be designed using many approaches such as :
Natural Language Interface/Environment for Data Input/Output support. Operating System level support at the native level for the Indianl languages. Indian Language shell over the existing operating systems and applications. Localising existing applications. Developing specific applications. Designing language compilers in natural languages.
____________________________
Developed by CDAC,Noida Maintained by Department of Information Technology
------------------------ Yahoo! Groups Sponsor --------------------~--> Give underprivileged students the materials they need to learn. Bring education to life by funding a specific classroom project. http://us.click.yahoo.com/FHLuJD/_WnJAA/cUmLAA/tSwplB/TM --------------------------------------------------------------------~->
Yahoo! Groups Links
<*> To visit your group on the web, go to: http://groups.yahoo.com/group/eGovINDIA/
<*> To unsubscribe from this group, send an email to: eGovINDIA-unsubscribe@yahoogroups.com
<*> Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/
participants (1)
-
Subbiah Arunachalam