Fw: [eGovINDIA] Indian Language Processing Tools

17 Apr 2005

      Friends (especially those in Africa):

India has made some advances in local language software. Please see the 
material appended below. Africa can benefit from India's knowledge, rich 
experience and expertise. At the Africa Regioanla Meeting held at Accra 
(preparatory to the WSIS at Tunis), there was a session on language software 
and I drew the attention of Prof. Adama Samasekou to the possibility of 
Africa and India working together in this area.

I also suggested that C-DAC and other Indian institutions could be 
contacted. Incidentally, as part of the Million Books project, Prof. N 
Balakrishnan of the Indian Institute of Science, Bangalore, and his team 
have developed some very good translation software. I-Network Uganda might 
wish to discuss these issues in one of its meetings.

Best wishes.

Arun
[Subbiah Arunachalam]

----- Original Message ----- 
From: "eindiagov" <eindiagov@yahoo.com>
To: <eGovINDIA@yahoogroups.com>
Sent: Sunday, April 17, 2005 9:32 AM
Subject: [eGovINDIA] Indian Language Processing Tools
...
Indian Language Processing Tools
http://tdil.mit.gov.in/nlptools/ach-nlptools.htm
Internet Tools and Technologies Developed by the respective
Institutes/Organizations
Sanskrit Processing Tools Developed by the respective
Institutes/Organizations
Sanskrit Authoring System
Desika Software
Shabadbodha Software
Spell checkers
Internet Tools and Technologies Developed by the respective
Institutes/Organizations
Java based Solutions : IIT Kanpur
Displaying Web Documents through Negotiation and Dynamic
Rendering : Web authors create documents in a variety of languages
using a variety of character sets and fonts. It is not possible for
the viewer of the document to have all those fonts and character
sets present on his system. Thus either the client is required to
download the fonts or install these on his/her system or install
some software on his system to help in the process. For a truly
portable solution, the client need not specially install any fonts
or software on his system.
    The Java-centric solution for displaying the Devanagari
documents has been developed. Java Applet using public domain
font `Bhagwan' displays the Devanagari documents in true type font.
The applet and the font related information is around 100k. The
Server Java Applet encodes the glyphs and sends it with the document
so that Hindi Font is not required at the client side for browsing
the document. Hindi Search engine has been developed on Linux
platform.
Automatic Font Installer : The users are required to generally
download the fonts for viewing the non-roman web sites or manually
install the fonts, which is a cumbersome task. A single executable
has been developed to carry out the process whenever the user
chooses the font installation option. The font installer program
runs on the server and installs the fonts on the client machine.
URL Addresses in Devanagari : The current browsers don't allow
URL and filling of forms in HTML pages to be entered in Hindi.
Therefore, the world wide web needs to be multi-lingualised. It
involves conversion of URL in the local language to the
internationally agreed format, such as UNICODE, with UTF-8 encoding.
The user will be able to fill the forms in the document fetched in a
local language. Further, the user is allowed to choose between a set
of languages in which to view the document, if the document is
available in various languages. The solution is based on the Swing
components in JDK1.2 and assumes the font to be present on the local
machine.
Indian Language Search Engine : The search engine should allow
indexing and searching of Devanagari HTML documents. The basic
components are gatherer, indexer, and Search processor. Indexer and
Search processor are being designed as these two modules deal with
syntax and semantics of the language of the text. Indexer will
perform processing such as keyword extraction, stop word removal,
stemming (handling different forms of a word), handling of word
synonyms, and term weight calculation. Search Processor looks up the
index to find the documents containing the query keywords, calculate
a relevance score, and ranks them according to the score. This
search engine will also search the keywords occurred in a composite
word (combined according to `SANDHI' rules, for example s/w will
give a match for keyboard `ram' if it finds `rameshwar' in the
document). It is assumed that the documents are in ISCII.
Heritage Website : The website has been developed containing
material concerning traditional Indian texts centred around
the `Upnishads and the Bhaagwadgita'. The functionality provided is:
              There are font-downloading applications (.exe) files.
Users can download any                   Indian language font and
install the font.
              The commentaries include some technical words, which
have Sanskrit origin. The                   definitions for such
technical words are displayed with a mouse click.
              Links between related shlokas within the same
Upnishad/Gita for cross                  referencing
              Each hyperised text has a search mechanism by which
user can locate the                  occurrence of any word and view
that particular shlokas with it's translation and
commentary.
              User selected language for viewing the Mool
Shlokas
CD Authoring Tools for Indian Language Documents : The
technologies that are being used for publishing on the www viz.
HTML, XML, Java, Javascript etc. are also being increasingly used
for document delivery over CDs. These days the entire computer
related documentation is accessible using the web browsers. It is
expected that this trend will pervade into non-technical publishing
also. The development of Indian Language CD Publishers
ToolBox, `site management' tools and searches integrated with a
dictionary are underway.
ActiveX based Solutions - C-DAC, Pune
Web based E-mail :Hindi e-mail service, has been developed which
uses advanced ActiveX technologies available with Internet Explorer
4.0 and later versions of browsers for enabling the keyboarding and
fonts for Indian languages on the client PC. This service provides a
facility to type the text in Hindi language for sending an e-mail in
Hindi which gets converted into HTML format. This converted Hindi
text in HTML and with font codes is delivered at the Email address
of the user, who can just place it on any Web Page using any
standard HTML editor like Netscape Composer.
            The software components namely ActiveX Controls and
Hindi Fonts get downloaded and installed on the client's computer
when the user first time accesses the system. Every time the user
accesses the e-mail server, a check is made for the installed
components.
            To be able to send/view mail, the user must first
create an account on the system by defining login name and password.
An account holder on the system can send a message to any other
account holder. The user types in the message using the Inscript
keyboard overlay. The message is stored in a database. The data on
the server is stored in ISCII. When a request for reading an e-mail
message is received, the server retrieves the message from the
database and creates a HTML file containing the message with the
Hindi font information on the fly and delivers it.
            Microsoft Visual Interdev is the IDE, which uses the
power of ASP (Active Server Pages) to make web pages and connect to
the back end Database using ODBC drivers for MS SQL server. Using
ASP, queries have been made to the database from the webpage.
ActiveX technology based Hindi e-mail, search engine and Bulletin
Board System has been developed. Hindi e-mail also stores documents
in ISCII format.
Hindi Bulletin Board System : It is under development. This web
based application allows users to create topics for discussion and
maintains threads within a topic.
Hindi Search Engine : Development is underway which involves the
following:
Manually surf the net and build indexes for
documents in Hindi
               Invite the Hindi language document creator to submit
web pages URL with page                    description and keyword
in Hindi to the search engine i.e. build a web
based                    application to collect data in Hindi
               Build special search techniques for Hindi based on
word                    morphology/thesaurus/sandhi etc.
               Deliver HTML document index description in Hindi for
search result.
               Define standards for Meta-tags etc. for Indian
languages such that future spiders                    can retrieve
documents for a particular language
Multi-lingul E-mail Client : A working prototype has been
developed to facilitate the clients for sending and receiving e-
mails in Hindi without having need to have Internet connection
provided sender and receiver both have this s/w. The application
will use technologies like MAPI, Extended MAPI and COM to
communicate with the interfaces provided by MS Exchange. The
application downloads all the mails received via the POP3 server and
stores them locally on the machine. This storage of mails is taken
care of by the MS-Exchange. The application provides access to
various folders like InBox, Sent Mails, Deleted Mials and OutBox for
convenience of the user.
Sanskrit Processing Tools Developed by the respective
Institutes/Organizations
DESIKA - Centre for Development of Advanced Computing (C-DAC),
Bangalore
The Software package, DESIKA is a Natural Language
Understanding System for Sanskrit. This software incorporates
language generation and analysis modules for plain and accented
written Sanskrit texts. It is based on the principles of ancient
Indian Sciences. DESIKA aims to process all the words of Sanskrit,
includes generation and analysis (parsing), has an exhaustive
database based on Amarakosha, the most popular Sanskrit lexicon,
rule base using the grammar rules of Panini's Ashtadhyayi and
heuristics based on Nyaya & Mimamsa sastras for semantic and
contexual processing. This software can also analyse Vedic
(scriptural) texts.
      The highlight of DESIKA is the analysis module which is a
general purpose Sanskrit parser currently being extended to handle
compound and combined word forms dissolution and identification.
Vedic analysis is also under way. Rigveda and Taittiriya branch of
Krishna Yajur Veda analysis using Taittiriya pratishakya and Vaidika
Prakriya of Ashtadhyayi.
      The DESIKA software helps in understanding a natural language
input (typically an isolated sentence) through paraphrasing, voice
change, query answering or summarising, to develop a language-
independent knowledge representation scheme based on ancient Indian
Sciences, to develop tools for linguistic analysis and to assist in
analysis & presentation of scriptural (accented text) knowledge,
phonetic and language research, teaching etc., It was developed on
DOS platform and has now been ported on Windows platform.
Sanskrit Authoring System - C-DAC, Bangalore
Sanskrit word processor is under development which will even
handle special Sanskrit conjucts. The requirements which will be
catered by this s/w are:
Word Processing in Sanskrit
Statistical Tools like concordance, thesauri, electronic
dictionaries etc.
Transliteration Facility
Search/Sort Algorithms
Word Split Programs for Sandhi and Samasa
Fonts for various scripts, web access, web hosting, publishing etc.
Poetry Analysis (Textual/metric/statistical)
Manual content for Amarakosha, Grammar rules, Derivations, Quotes
from Vedas (scriptures)
Epic like Ramayana, Mahabharata, other Puranas, Shastraic texts in
sutra and Authentic Reference
On-line readers/primers of Indian Shastraic texts
Tools for morphological, syntactic and semantic analysis
Tools for linguistic analysis like tagging, lemmatising,
statistical studies etc.
Syntactic and Semantic Analysis of Sanskrit Sentences - Academy of
Sanskrit Research, Melkote
Software for syntactic and semantic analysis of Sanskrit
sentences has been developed on DOS platform with GIST card and is
being ported to Windows platform. The sentence has been considered
the basic unit for analysis since it is the backbone of verbal
communication between the human beings. The importance of words will
be known only when the meaning of sentence is known. Systematic
classification of words and a robust grammar can help in deriving
the knowledge from Sanskrit and build a system which will help in
the development of Natural Language Processing Systems.The various
modules of the system are:
Subanta: It can handle generation and analysis of all the
case inflected forms of more than 26,000 stems.
      Tinanta: It can handle the conjugational forms of roots, in
two voices, ten lakaras and three modes viz. Kevala Tiganta, Nijanta
and Sannanta.
      Krdanta: It is capable of handling generation analysis and
identification of case inflected forms of 11 types of krdantas of
150 roots.
      Databases: 690 Avyayas, 26, 000 Nominal stems, 600 Verbal
roots, krdanata forms of 600 verbal roots, 5 Taddhita suffixes.
The parts of speech handled for analysing are nouns, pronouns,
adjectives, participles, Indeclinables, Indeclinable participle and
verbs. Sentences with multiple adjectives and participles can also
be analysed. Sentences constructed by picking up any words from the
database can be syntactically analysed. But semantic analysis is
done within a limited domain. For handling the semantic analysis, a
matrix has been prepared which consists of 52 sets of nouns with
their synonyms amounting to 300 nouns, 27 actions denoted by nearly
200 verbs. Syntactic and semantic analysis of simple passage
consisting of not more than 10 simple sentences has been done
successfully.
Computer Assisted Sanskrit Teaching & Learning Environment
(CASTLE) - Jawahar Lal Nehru University, New Delhi
CASTLE s/w on DOS platform with GIST card has been developed
for Sanskrit teaching and learning as a stand-alone application.
Under this project, the synthesis aspect of Sanskrit phonology and
word morphology has been handled. The various modules developed
under this system are:
Pratyahara: It deals with the sound classes of Paninian
grammar. It may be described as a shorthand notation to refer to a
group of items.
       Sandhi: Euphonic combination relating to sound units is
called `sandhi'. It is a common module for various types of word
formation. A sandhi type depends on the final phoneme of the first
word and the initial phoneme of the second word. It also includes a
program for internal sandhi called Natva-satva vidhana.
      Subanta: This module deals with the nominal inflexion. The
System inputs are noun base with its attributes and the output is
the 21+3 inflected forms of the noun.
      Tiganta: Verbal conjugation is called tiganta. This module
takes the verb and lakara (tense/mood) as inputs, and generates 9
conjugated forms of the verb in each pada.
      Kridanta: The primary derivatives are called kridanta. The
inputs to the system are the semantic condition, verb root and krit
suffix. The kridanta form is the output.
      Taddhita: The secondary derivatives are called taddhita.
System inputs are the semantic condition, noun base and taddhita
suffix. Taddhita form is the output.
      Samasa: Compound formation is known as samasa. Two or more
words are joined to form a new word. The inputs to the system are
two or more noun bases, which are characterized by a semantic
condition, and the normal suffixes.
      Sri-pratyayas: These suffixes are added to primary verbal
roots to derive secondary verbal roots. The derived verb is again
sent to the tiganta module to generate 9 conjugated forms of the
verb in each pada.
Following Demonstrative modules for learning/teaching of
Sanskrit have also been developed:
Teaching Varnmala: This module deals with the teaching of
Sanskrit alphabet alongwith their
characteristics. Exercises for testing knowledge of Varnmala have
also been prepared.
       Sandhi Viccheda: The system takes a word as input, and
returns the constituent words.
       Subanta Viccheda: The input word is split into the root word
and suffix. Besides, the grammatical attributes associated with the
root word, i.e. the noun-base are also displayed.
       Tiganta Viccheda: The input word is split into the root word
and suffix. Besides, the grammatical attributes associated with the
root, which is a verbal root, are also displayed.
Sanskrit Authoring System
Sanskrit Authoring System including a Sanskrit word processor
for use by Sanskrit scholars in text processing etc is being
developed at C-DAC, Bangalore.
Desika Software
This   Software package  is a Natural Language Understanding
System for Sanskrit, developed at Indian Heritage Group of the
Centre for Development of Advanced Computing (C-DAC), Bangalore, a
Scientific Society of the Ministry of Information Technology,
Government of India, Ramanashree Plaza, 2/1, Brunton Road,Bangalore
(Karnataka).   This software incorporates language generation and
analysis modules for plain and accented written Sanskrit texts. It
is based on the principles of ancient Indian Sciences. DESIKA aims
to process all the words of Sanskrit, includes generation and
analysis (parsing), has an exhaustive database based on Amarakosha,
a the most popular Sanskrit lexicon, rule base using the grammar
rules of Panini's Ashtadhyayi and heuristics based on Nyaya &
Mimamsa sastras for semantic and contexual processing. This software
can also analyse Vedic (scriptural) texts.
Shabadbodha Software
Shabdhabodha is an interactive application built to analyse
the semantic and syntactic structure of  Sanskrit sentences.  It
works on MS-DOS Platform version 6.0 or higher with GIST shell.  It
has been developed at ASR, Melkote.
Spell checkers
Spell checkers are  useful for word processing and are
mostly integrated with the word processing softwares. Spell checkers
in few Indian Languages are available.  The development of Spell
checkers is covered within the scope of the current projects for
corpora development.
Punjabi Spell-checker has been developed has been developed at
CEDTI, Mohali.
Special requirements for Indian Language Processing (ILP)
India is a large multilingual society with as many as eighteen
constitutionally recognised languages including English and the
National language is Hindi. There are multiple scripts for these
languages. With increase in trade and development across the country
it becomes necessary for the people to communicate in more than one
language. In such circumstances, Information Technology(IT) appears
to be a promising tool for the development of ILP systems which aim
at overcoming the language barrier. These ILP tools could be
designed using many approaches such as :
Natural Language Interface/Environment for Data Input/Output
support.
Operating System level support at the native level for the Indianl
languages.
Indian Language shell over the existing operating systems and
applications.
Localising existing applications.
Developing specific applications.
Designing language compilers in natural languages.
____________________________
Developed by CDAC,Noida
Maintained by Department of Information Technology
------------------------ Yahoo! Groups Sponsor --------------------~-->
Give underprivileged students the materials they need to learn.
Bring education to life by funding a specific classroom project.
http://us.click.yahoo.com/FHLuJD/_WnJAA/cUmLAA/tSwplB/TM
--------------------------------------------------------------------~->
Yahoo! Groups Links
<*> To visit your group on the web, go to:
   http://groups.yahoo.com/group/eGovINDIA/
<*> To unsubscribe from this group, send an email to:
   eGovINDIA-unsubscribe@yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
   http://docs.yahoo.com/info/terms/

Subbiah Arunachalam

tags

participants (1)