[LIS-Forum] Fw: [eGovINDIA] Indian Language Processing Tools

Subbiah Arunachalam arun at mssrf.res.in
Sun Apr 17 14:19:32 IST 2005


Friends (especially those in Africa):

India has made some advances in local language software. Please see the 
material appended below. Africa can benefit from India's knowledge, rich 
experience and expertise. At the Africa Regioanla Meeting held at Accra 
(preparatory to the WSIS at Tunis), there was a session on language software 
and I drew the attention of Prof. Adama Samasekou to the possibility of 
Africa and India working together in this area.

I also suggested that C-DAC and other Indian institutions could be 
contacted. Incidentally, as part of the Million Books project, Prof. N 
Balakrishnan of the Indian Institute of Science, Bangalore, and his team 
have developed some very good translation software. I-Network Uganda might 
wish to discuss these issues in one of its meetings.

Best wishes.

Arun
[Subbiah Arunachalam]


----- Original Message ----- 
From: "eindiagov" <eindiagov at yahoo.com>
To: <eGovINDIA at yahoogroups.com>
Sent: Sunday, April 17, 2005 9:32 AM
Subject: [eGovINDIA] Indian Language Processing Tools


>
>
> Indian Language Processing Tools
> http://tdil.mit.gov.in/nlptools/ach-nlptools.htm
>
> Internet Tools and Technologies Developed by the respective
> Institutes/Organizations
> Sanskrit Processing Tools Developed by the respective
> Institutes/Organizations
> Sanskrit Authoring System
> Desika Software
> Shabadbodha Software
> Spell checkers
>
>
> Internet Tools and Technologies Developed by the respective
> Institutes/Organizations
>
> Java based Solutions : IIT Kanpur
>
>    Displaying Web Documents through Negotiation and Dynamic
> Rendering : Web authors create documents in a variety of languages
> using a variety of character sets and fonts. It is not possible for
> the viewer of the document to have all those fonts and character
> sets present on his system. Thus either the client is required to
> download the fonts or install these on his/her system or install
> some software on his system to help in the process. For a truly
> portable solution, the client need not specially install any fonts
> or software on his system.
>     The Java-centric solution for displaying the Devanagari
> documents has been developed. Java Applet using public domain
> font `Bhagwan' displays the Devanagari documents in true type font.
> The applet and the font related information is around 100k. The
> Server Java Applet encodes the glyphs and sends it with the document
> so that Hindi Font is not required at the client side for browsing
> the document. Hindi Search engine has been developed on Linux
> platform.
>
>    Automatic Font Installer : The users are required to generally
> download the fonts for viewing the non-roman web sites or manually
> install the fonts, which is a cumbersome task. A single executable
> has been developed to carry out the process whenever the user
> chooses the font installation option. The font installer program
> runs on the server and installs the fonts on the client machine.
>
>    URL Addresses in Devanagari : The current browsers don't allow
> URL and filling of forms in HTML pages to be entered in Hindi.
> Therefore, the world wide web needs to be multi-lingualised. It
> involves conversion of URL in the local language to the
> internationally agreed format, such as UNICODE, with UTF-8 encoding.
> The user will be able to fill the forms in the document fetched in a
> local language. Further, the user is allowed to choose between a set
> of languages in which to view the document, if the document is
> available in various languages. The solution is based on the Swing
> components in JDK1.2 and assumes the font to be present on the local
> machine.
>
>    Indian Language Search Engine : The search engine should allow
> indexing and searching of Devanagari HTML documents. The basic
> components are gatherer, indexer, and Search processor. Indexer and
> Search processor are being designed as these two modules deal with
> syntax and semantics of the language of the text. Indexer will
> perform processing such as keyword extraction, stop word removal,
> stemming (handling different forms of a word), handling of word
> synonyms, and term weight calculation. Search Processor looks up the
> index to find the documents containing the query keywords, calculate
> a relevance score, and ranks them according to the score. This
> search engine will also search the keywords occurred in a composite
> word (combined according to `SANDHI' rules, for example s/w will
> give a match for keyboard `ram' if it finds `rameshwar' in the
> document). It is assumed that the documents are in ISCII.
>
>    Heritage Website : The website has been developed containing
> material concerning traditional Indian texts centred around
> the `Upnishads and the Bhaagwadgita'. The functionality provided is:
>               There are font-downloading applications (.exe) files.
> Users can download any                   Indian language font and
> install the font.
>               The commentaries include some technical words, which
> have Sanskrit origin. The                   definitions for such
> technical words are displayed with a mouse click.
>               Links between related shlokas within the same
> Upnishad/Gita for cross                  referencing
>               Each hyperised text has a search mechanism by which
> user can locate the                  occurrence of any word and view
> that particular shlokas with it's translation and
> commentary.
>               User selected language for viewing the Mool
> Shlokas
>
>     CD Authoring Tools for Indian Language Documents : The
> technologies that are being used for publishing on the www viz.
> HTML, XML, Java, Javascript etc. are also being increasingly used
> for document delivery over CDs. These days the entire computer
> related documentation is accessible using the web browsers. It is
> expected that this trend will pervade into non-technical publishing
> also. The development of Indian Language CD Publishers
> ToolBox, `site management' tools and searches integrated with a
> dictionary are underway.
>
> ActiveX based Solutions - C-DAC, Pune
>
>    Web based E-mail :Hindi e-mail service, has been developed which
> uses advanced ActiveX technologies available with Internet Explorer
> 4.0 and later versions of browsers for enabling the keyboarding and
> fonts for Indian languages on the client PC. This service provides a
> facility to type the text in Hindi language for sending an e-mail in
> Hindi which gets converted into HTML format. This converted Hindi
> text in HTML and with font codes is delivered at the Email address
> of the user, who can just place it on any Web Page using any
> standard HTML editor like Netscape Composer.
>             The software components namely ActiveX Controls and
> Hindi Fonts get downloaded and installed on the client's computer
> when the user first time accesses the system. Every time the user
> accesses the e-mail server, a check is made for the installed
> components.
>             To be able to send/view mail, the user must first
> create an account on the system by defining login name and password.
> An account holder on the system can send a message to any other
> account holder. The user types in the message using the Inscript
> keyboard overlay. The message is stored in a database. The data on
> the server is stored in ISCII. When a request for reading an e-mail
> message is received, the server retrieves the message from the
> database and creates a HTML file containing the message with the
> Hindi font information on the fly and delivers it.
>             Microsoft Visual Interdev is the IDE, which uses the
> power of ASP (Active Server Pages) to make web pages and connect to
> the back end Database using ODBC drivers for MS SQL server. Using
> ASP, queries have been made to the database from the webpage.
> ActiveX technology based Hindi e-mail, search engine and Bulletin
> Board System has been developed. Hindi e-mail also stores documents
> in ISCII format.
>
>    Hindi Bulletin Board System : It is under development. This web
> based application allows users to create topics for discussion and
> maintains threads within a topic.
>
>    Hindi Search Engine : Development is underway which involves the
> following:
>
>                Manually surf the net and build indexes for
> documents in Hindi
>                Invite the Hindi language document creator to submit
> web pages URL with page                    description and keyword
> in Hindi to the search engine i.e. build a web
> based                    application to collect data in Hindi
>                Build special search techniques for Hindi based on
> word                    morphology/thesaurus/sandhi etc.
>                Deliver HTML document index description in Hindi for
> search result.
>                Define standards for Meta-tags etc. for Indian
> languages such that future spiders                    can retrieve
> documents for a particular language
>
>     Multi-lingul E-mail Client : A working prototype has been
> developed to facilitate the clients for sending and receiving e-
> mails in Hindi without having need to have Internet connection
> provided sender and receiver both have this s/w. The application
> will use technologies like MAPI, Extended MAPI and COM to
> communicate with the interfaces provided by MS Exchange. The
> application downloads all the mails received via the POP3 server and
> stores them locally on the machine. This storage of mails is taken
> care of by the MS-Exchange. The application provides access to
> various folders like InBox, Sent Mails, Deleted Mials and OutBox for
> convenience of the user.
>
> Sanskrit Processing Tools Developed by the respective
> Institutes/Organizations
>
> DESIKA - Centre for Development of Advanced Computing (C-DAC),
> Bangalore
>
>       The Software package, DESIKA is a Natural Language
> Understanding System for Sanskrit. This software incorporates
> language generation and analysis modules for plain and accented
> written Sanskrit texts. It is based on the principles of ancient
> Indian Sciences. DESIKA aims to process all the words of Sanskrit,
> includes generation and analysis (parsing), has an exhaustive
> database based on Amarakosha, the most popular Sanskrit lexicon,
> rule base using the grammar rules of Panini's Ashtadhyayi and
> heuristics based on Nyaya & Mimamsa sastras for semantic and
> contexual processing. This software can also analyse Vedic
> (scriptural) texts.
>       The highlight of DESIKA is the analysis module which is a
> general purpose Sanskrit parser currently being extended to handle
> compound and combined word forms dissolution and identification.
> Vedic analysis is also under way. Rigveda and Taittiriya branch of
> Krishna Yajur Veda analysis using Taittiriya pratishakya and Vaidika
> Prakriya of Ashtadhyayi.
>       The DESIKA software helps in understanding a natural language
> input (typically an isolated sentence) through paraphrasing, voice
> change, query answering or summarising, to develop a language-
> independent knowledge representation scheme based on ancient Indian
> Sciences, to develop tools for linguistic analysis and to assist in
> analysis & presentation of scriptural (accented text) knowledge,
> phonetic and language research, teaching etc., It was developed on
> DOS platform and has now been ported on Windows platform.
>
> Sanskrit Authoring System - C-DAC, Bangalore
>
>       Sanskrit word processor is under development which will even
> handle special Sanskrit conjucts. The requirements which will be
> catered by this s/w are:
>
> Word Processing in Sanskrit
> Statistical Tools like concordance, thesauri, electronic
> dictionaries etc.
> Transliteration Facility
> Search/Sort Algorithms
> Word Split Programs for Sandhi and Samasa
> Fonts for various scripts, web access, web hosting, publishing etc.
> Poetry Analysis (Textual/metric/statistical)
> Manual content for Amarakosha, Grammar rules, Derivations, Quotes
> from Vedas (scriptures)
> Epic like Ramayana, Mahabharata, other Puranas, Shastraic texts in
> sutra and Authentic Reference
> On-line readers/primers of Indian Shastraic texts
> Tools for morphological, syntactic and semantic analysis
> Tools for linguistic analysis like tagging, lemmatising,
> statistical studies etc.
>
> Syntactic and Semantic Analysis of Sanskrit Sentences - Academy of
> Sanskrit Research, Melkote
>
>   Software for syntactic and semantic analysis of Sanskrit
> sentences has been developed on DOS platform with GIST card and is
> being ported to Windows platform. The sentence has been considered
> the basic unit for analysis since it is the backbone of verbal
> communication between the human beings. The importance of words will
> be known only when the meaning of sentence is known. Systematic
> classification of words and a robust grammar can help in deriving
> the knowledge from Sanskrit and build a system which will help in
> the development of Natural Language Processing Systems.The various
> modules of the system are:
>
>       Subanta: It can handle generation and analysis of all the
> case inflected forms of more than 26,000 stems.
>       Tinanta: It can handle the conjugational forms of roots, in
> two voices, ten lakaras and three modes viz. Kevala Tiganta, Nijanta
> and Sannanta.
>       Krdanta: It is capable of handling generation analysis and
> identification of case inflected forms of 11 types of krdantas of
> 150 roots.
>       Databases: 690 Avyayas, 26, 000 Nominal stems, 600 Verbal
> roots, krdanata forms of 600 verbal roots, 5 Taddhita suffixes.
>
> The parts of speech handled for analysing are nouns, pronouns,
> adjectives, participles, Indeclinables, Indeclinable participle and
> verbs. Sentences with multiple adjectives and participles can also
> be analysed. Sentences constructed by picking up any words from the
> database can be syntactically analysed. But semantic analysis is
> done within a limited domain. For handling the semantic analysis, a
> matrix has been prepared which consists of 52 sets of nouns with
> their synonyms amounting to 300 nouns, 27 actions denoted by nearly
> 200 verbs. Syntactic and semantic analysis of simple passage
> consisting of not more than 10 simple sentences has been done
> successfully.
>
> Computer Assisted Sanskrit Teaching & Learning Environment
> (CASTLE) - Jawahar Lal Nehru University, New Delhi
>
>
>       CASTLE s/w on DOS platform with GIST card has been developed
> for Sanskrit teaching and learning as a stand-alone application.
> Under this project, the synthesis aspect of Sanskrit phonology and
> word morphology has been handled. The various modules developed
> under this system are:
>
>        Pratyahara: It deals with the sound classes of Paninian
> grammar. It may be described as a shorthand notation to refer to a
> group of items.
>        Sandhi: Euphonic combination relating to sound units is
> called `sandhi'. It is a common module for various types of word
> formation. A sandhi type depends on the final phoneme of the first
> word and the initial phoneme of the second word. It also includes a
> program for internal sandhi called Natva-satva vidhana.
>       Subanta: This module deals with the nominal inflexion. The
> System inputs are noun base with its attributes and the output is
> the 21+3 inflected forms of the noun.
>       Tiganta: Verbal conjugation is called tiganta. This module
> takes the verb and lakara (tense/mood) as inputs, and generates 9
> conjugated forms of the verb in each pada.
>       Kridanta: The primary derivatives are called kridanta. The
> inputs to the system are the semantic condition, verb root and krit
> suffix. The kridanta form is the output.
>       Taddhita: The secondary derivatives are called taddhita.
> System inputs are the semantic condition, noun base and taddhita
> suffix. Taddhita form is the output.
>       Samasa: Compound formation is known as samasa. Two or more
> words are joined to form a new word. The inputs to the system are
> two or more noun bases, which are characterized by a semantic
> condition, and the normal suffixes.
>       Sri-pratyayas: These suffixes are added to primary verbal
> roots to derive secondary verbal roots. The derived verb is again
> sent to the tiganta module to generate 9 conjugated forms of the
> verb in each pada.
>
>        Following Demonstrative modules for learning/teaching of
> Sanskrit have also been developed:
>
>        Teaching Varnmala: This module deals with the teaching of
> Sanskrit alphabet alongwith their
> characteristics. Exercises for testing knowledge of Varnmala have
> also been prepared.
>        Sandhi Viccheda: The system takes a word as input, and
> returns the constituent words.
>        Subanta Viccheda: The input word is split into the root word
> and suffix. Besides, the grammatical attributes associated with the
> root word, i.e. the noun-base are also displayed.
>        Tiganta Viccheda: The input word is split into the root word
> and suffix. Besides, the grammatical attributes associated with the
> root, which is a verbal root, are also displayed.
>
>
> Sanskrit Authoring System
>
>       Sanskrit Authoring System including a Sanskrit word processor
> for use by Sanskrit scholars in text processing etc is being
> developed at C-DAC, Bangalore.
>
> Desika Software
>
>        This   Software package  is a Natural Language Understanding
> System for Sanskrit, developed at Indian Heritage Group of the
> Centre for Development of Advanced Computing (C-DAC), Bangalore, a
> Scientific Society of the Ministry of Information Technology,
> Government of India, Ramanashree Plaza, 2/1, Brunton Road,Bangalore
> (Karnataka).   This software incorporates language generation and
> analysis modules for plain and accented written Sanskrit texts. It
> is based on the principles of ancient Indian Sciences. DESIKA aims
> to process all the words of Sanskrit, includes generation and
> analysis (parsing), has an exhaustive database based on Amarakosha,
> a the most popular Sanskrit lexicon, rule base using the grammar
> rules of Panini's Ashtadhyayi and heuristics based on Nyaya &
> Mimamsa sastras for semantic and contexual processing. This software
> can also analyse Vedic (scriptural) texts.
>
> Shabadbodha Software
>
>        Shabdhabodha is an interactive application built to analyse
> the semantic and syntactic structure of  Sanskrit sentences.  It
> works on MS-DOS Platform version 6.0 or higher with GIST shell.  It
> has been developed at ASR, Melkote.
>
> Spell checkers
>
>        Spell checkers are  useful for word processing and are
> mostly integrated with the word processing softwares. Spell checkers
> in few Indian Languages are available.  The development of Spell
> checkers is covered within the scope of the current projects for
> corpora development.
> Punjabi Spell-checker has been developed has been developed at
> CEDTI, Mohali.
>
>
> Special requirements for Indian Language Processing (ILP)
>
> India is a large multilingual society with as many as eighteen
> constitutionally recognised languages including English and the
> National language is Hindi. There are multiple scripts for these
> languages. With increase in trade and development across the country
> it becomes necessary for the people to communicate in more than one
> language. In such circumstances, Information Technology(IT) appears
> to be a promising tool for the development of ILP systems which aim
> at overcoming the language barrier. These ILP tools could be
> designed using many approaches such as :
>
> Natural Language Interface/Environment for Data Input/Output
> support.
> Operating System level support at the native level for the Indianl
> languages.
> Indian Language shell over the existing operating systems and
> applications.
> Localising existing applications.
> Developing specific applications.
> Designing language compilers in natural languages.
>
> ____________________________
>
> Developed by CDAC,Noida
> Maintained by Department of Information Technology
>
>
>
>
>
> ------------------------ Yahoo! Groups Sponsor --------------------~-->
> Give underprivileged students the materials they need to learn.
> Bring education to life by funding a specific classroom project.
> http://us.click.yahoo.com/FHLuJD/_WnJAA/cUmLAA/tSwplB/TM
> --------------------------------------------------------------------~->
>
>
> Yahoo! Groups Links
>
> <*> To visit your group on the web, go to:
>    http://groups.yahoo.com/group/eGovINDIA/
>
> <*> To unsubscribe from this group, send an email to:
>    eGovINDIA-unsubscribe at yahoogroups.com
>
> <*> Your use of Yahoo! Groups is subject to:
>    http://docs.yahoo.com/info/terms/
>
>
> 




More information about the LIS-Forum mailing list