புதன், 11 நவம்பர், 2015

Centre for Computational Linguistics ( CCL)

       Centre for Computational Linguistics ( CCL)
The present Proposal for the establishment of a Centre for Computational Linguistics (CCL) in Chennai by Government of Tamilnadu  is submitted by Prof. N. Deiva Sundaram (Former Director, Linguistic Studies Unit,  University of Madras).
The proposal contains the following parts:
1.      A brief note on the importance for Computational Linguistics in the development of Indian languages into e-languages (electronic languages)
2.      Vision and Mission Statement and Objectives of the proposed CCL.

1.    Importance of  Computational Linguistics in the development of Indian languages
Information Technology, Communication and Language:
The role of Information Technology in the modern stage of social development ( “ Information Society”)  is undisputable. It is getting much importance due to the development of Globalization. Man – Machine (computer) communication is very much needed to undertake most of the tasks in every field from day-to-day activities to huge industrial and business processes.
In this communication, along with the non-verbal means such as charts, pictures, diagrams etc., and the verbal means – that is, language- plays an important and key role. Language is the prime vehicle in which information is encoded, by which it is accessed and through which it is disseminated.
Language is represented in both graphical (“written”) and sound (“spoken”) media. Thus, to interact with the computer, we may have to use both written and spoken forms of any language. That is, we should be able to send our request to the computer and get response either through the written language or through the spoken language or with both.
Then, for the computer to process the natural language – for understanding and generation - , the knowledge of natural language should be provided to it. The capacity of natural language cognition by the human brain should be given to the computer. The science which deals with this is called Natural Language Processing (NLP).         

Natural Language Processing (NLP):
Natural Language Processing refers to the interactions between the computer and natural languages, i.e., to make the computer to identify, understand process and to produce the utterances of natural human language, as the human brain does it.
The study of the language processing within human brain is part of the Science of Cognition or Human intelligence. Likewise the language processing by the computer – the NLP – is part of the Science of Artificial Intelligence (AI).
Various components of Natural Language Understanding (NLU) systems convert samples of human language into more computer readable representations such as parse trees or first order logic which are easier for the computers to manipulate and Natural Language Generation ( NLG) systems convert information from computer databases into readable human language.
The knowledge of human language, needed by human brain to engage in complex language behavior, can be separated into the following six categories:
1.      Phonetics and Phonology – the study of linguistic sounds
2.      Morphology – the study of the meaningful components of words
3.      Syntax – the study of the structural relationship between words
4.      Semantics – The study of meaning
5.      Pragmatics – The study of how language is used to accomplish goals
6.      Discourse – The study of linguistic units larger than a single utterance
For the computer, in using the above mentioned linguistic knowledge in NLP, the most important task or problem is to resolve ambiguity at various levels.
Computational Linguistics (CL):
The science which deals with the various theories, models and algorithms to resolve the different ambiguities in Natural Language Processing is called Computational Linguistics (CL). 
The above mentioned models and theories are all drawn from the standard toolkits of Computer Science, Mathematics, and Linguistics. Thus, the science of Computational Linguistics is a multi-disciplinary one. It is concerned with the computational aspects of the human language faculty. It has applied and theoretical components.
Theoretical Computational Linguistics:
It develops formal models simulating aspects of the human language faculty and implements them as computer programmes. These programmes constitute the basis for the evaluation and further development of the theories. This forms the theoretical component of Computational Linguistics.
Applied Computational Linguistics / Language Technology (LT):
With the help of the theories, models and algorithms of CL, various applications such as Word processor, Text-to-Speech (TTS), Automatic Speech Recognizer (ASR), Optical Character Recognizer (OCR), and Automatic Machine Translation System are being developed. This comes under the field of Language Technology (LT) or the applied component of Computational Linguistics.
The Science of Computational Linguistics and its applied part “ Language Technology” thus contribute to verify the linguistic theories attempted to represent the human language faculty as well as to help us to communicate with the computer     ( “ Man – Machine Interface”) for  carrying out various tasks.
For the following, we need the help from Language Technology enterprise:
-        to ascertain  “ the rights of the people to benefit from the opportunity to easily access and effectively process information” ,
-        to help  the industries ” in the globalization of the economy to effectively communicate and manage information in an international context
-        to offer people “to better communicate, to provide them with the possibility of accessing information in a more natural way, to support more effective ways of exchanging information and control its growing mass” ,
-        to provide easy access to  multilingual information systems and to offer the possibility to handle the information they carry in a meaningful way.

Language Technology not only helps to develop Man – Machine communication, but also to the communication among us through computer (“Man – Machine – Man”). The later communication could be done either through written language or through spoken language or through both media.  Automatic Speech Recognizer (ASR), Text- to – Speech (TTS), Optical Character Recognizer, and Machine Translation are some of the software which are helpful to the above types of communication.
The above communication software are helpful to save our time, energy. Moreover, they are helpful to the differently enabled persons. The OCR and TTS may help the visually challenged persons to hear the digital materials without the help of the human readers. The ASR may help the hearing impaired persons to read the digital speech.
E-language planning for Tamil:
(E-language) Status Planning:
With the advent of globalization, during the later part of 20th century, time and space reduced considerably and territorial boundaries have lost their literal relevance and the planners were put under the dire need to think globally and to act locally. To meet the urgent need of the globalized environment, the language planners have started realizing that attaining e-language status is mandatory for any modern language that would enable its speech community, an effective participation with the world community in the era of globalization.
A language would attain e-language status only when that language is ready for all communication and information exchange activities through electronic media necessitated by the globalized pressure. To attain the perfect stage, the language should be equipped with all computing tools and systems from Word Processor to Man – Machine Interface.
The attainment of e-language status has become obligatory to avail the benefits of globalization with no further loss of time, otherwise the language with its nation state would be pushed decades back, devoid of its citizens the scientific and technological advancements contributed by the globalized processes.


(E-language) Corpus Planning:
Once e-language status is planned, the natural process that would follow is the Corpus Planning from phonological to discourse level. It is obvious that all tools and facilities namely, computer encoding, font development, Keyboard Development, corpus Development, Morphological Parsers, Syntactic Parsers, Semantic Analyzers, Pragmatic and Discourse Analyzers should be developed in the target language.
(E-language)  Acquisition Planning:
Under Acquisition Planning, e-learning facilities namely, e-dictionaries, e-Thesaurus, e-grammars and e-lessons should be developed in the target language to make non-native learners to study or apply the language in the desired field. 
E-language Planning and Computational Linguistics:
‘E-language Planning ‘herein refers to the planning of various activities for a natural language to make it suitable for the Language Technology.
‘e-language Planning’ activities may be divided into two parts, namely,
1.      To undertake necessary steps to develop various real-world applications with regard to the target language availing the avenues of Language Technology;
2.      To undertake necessary research, analysis and development of tools with regard to the target language from the perspective of Natural Language Processing and computational Linguistics.

Computational Linguistics Centres in other parts of the world:
Computational Linguistics and Language Technology have become an important branch in Applied Linguistics. Various theories and models have been developed and applied for various natural languages. In most of the Universities in the developed and developing countries, separate centres for this branch have been started.
The above development has now led the emergence of many innovative applications such Automatic Speech Recognizers, Speech Synthesizers, Machine Translations systems etc for many natural languages. These all contribute much to the present Globalization process. Multinational Governments such as European Parliament, multinational industries such as Ford enjoy the benefits of the achievements from Computational Linguistics and Language Technology. Also the problems emerging from multilingual situation are also getting solved with the help of these developments.
Computational Linguistics and Language Technology help us to solve the problems emerged from the “Digital Divide” also. If computer applications are available in local languages, then it would enable the local people to enjoy the benefit of computers and they will get the privilege of accessing the enormous knowledge gained from Internet.
 To achieve the Vision and Mission statements and Objectives mentioned above, the understanding and adoption of the Science of Computational Linguistics and Language Technology are very much essential. Realizing this necessity, the present proposal is submitted for the establishment of a Centre for Computational Linguistics in Chennai, Tamilnadu. It is to be mentioned here, at present there is no such separate centre for Computational Linguistics in Tamilnadu. Thus, the proposed Centre for CL will become a unique one. 


Centre for Computational Linguistics (CCL)

Vision Statement:
To make Tamil  language into an “e-language” suitable for all language computing tasks to meet out the expectations and challenges of globalization as well as to help  in bridging the gap in Tamil  society existing due to “Digital divide”.
Mission statement:
1.      To undertake the study of linguistic system of Tamil language from the computational linguistic perspective.
2.      To present the linguistic system of Tamil language in a form which could be processed by the computer.
3.      To develop various language computing tools from Word processor to Man-Machine Interface adopting the present State of the Art of Language Technology.
Objectives:
1.      To test and verify various modern linguistic formalisms with the help of computational linguistic theories.
2.      To develop computational grammar for Tamil language.
3.      To develop various types of linguistically annotated electronic corpus for Tamil  language.
4.       To develop various language computing tools such as Morphological Parsers, Word-class Taggers, Syntactic Parsers, tools for semantic analysis etc.
5.      To develop various types of lexical databases for Tamil  language adopting different formalisms such as WordNet, Generative Lexicon etc.
6.      To develop various application software for Indian languages from Auto Text-checking tools to automatic Machine Translation system.
7.      To develop language teaching materials  for e-learning facilities

Major activities of the proposed Centre for Computational Linguistics:

The proposed Centre for CL will involve in the following activities:
1.      Research and Development ( R & D)
2.      Teaching & Training
3.      Industrial Collaboration
4.      Coordination with other Institutions



1.      Research and Development ( R & D):
The R & D wing of CCL will at the initial stage mainly concentrate on the following:
a)      Study of the developments of  various formalisms and algorithms in Computational Linguistics
b)      Application of the Computational Linguistic formalisms to Tamil  language
c)      Development of  Electronic Corpus with proper linguistic and non-linguistic annotations
d)      Development of various linguistic analysis tools such as Morphological Parser, Word-class Tagger , Syntactic Parser, Semantic Analyzer, Word Sense Disambiguate, Concordancer,
e)      Development of Lexical Database for Tamil language based on WordNet, Generative Lexicon etc.
f)       Development of necessary phonological analysis (including acoustic phonetic analysis) tools useful for the development of Text-to-Speech (TTS), Automatic Speech Analyzer (ASR) as well as TTS and ASR for Tamil  language.
g)      Development of Auto-text checkers ( Spell-checking, Sandhi checking, Grammar checking etc.,)
h)      Development of Machine Translation System for Indian languages.
i)       Development of  Optical Character Recognizer ( both Off-line and On-line)
j)       Development of  Electronic dictionaries
k)      Development of  Computer-aided Language Teaching ( CALT) / Learning( CALL) materials

2.      Teaching and Training:
               Under this, the following tasks will be undertaken:
a)      M.A., and M.Phil., courses in Computational Linguistics
b)      Part-time Certificate and Diploma courses in Computational Linguistics
c)      Ph.D. Programmes in Computational Linguistics
d)      Short-term Training courses for researchers and teachers

3.      Industrial collaboration:
Under this, the CCL will help the software industries who are interested in NLP / CL/LT in their industrial ventures in the development of language software systems.
4.      Coordination with other Institutions:
There are many institutions in various parts of India involved in Computational Linguistics and Language Technology. In many IITS and Technology Universities, the Computer Science or Electronic Departments have been involved in developing some of the language computing tools. Also there are some Consortiums formed by the Ministry of Human Resources / Ministry of Information Technology for some specific tasks such as Corpus Development, Machine Translation. The proposed Centre at Chennai will take efforts to coordinate with them for further development and for avoiding duplication.

Infrastructure:
The proposed Centre for Computational Linguistics will have the following labs and other service units.
1.      Language Technology Lab ( 3 numbers  - Corpus Lab, Research Lab, Students Lab)
2.      Speech Technology Lab - 1
3.      CALL Lab ( Computer-aided Language Learning Lab) - 1
4.      Sound Recording Lab - 1
5.      Library -1
6.      Video Conferencing Hall -1
7.      Smart class rooms  - 6
8.      Meeting Hall - 1
9.      Conference Hall -1
10.   Faculty members rooms -
11.   Research Scholars rooms - 2
12.   Recreation rooms ( 2 )
13.   Administrative staff rooms - 5

Academic Staff: ( 1 + 11 + 11 + 32 = 55)
Faculty  (56)
1.      Chair – Director  – 1
2.      Senior Research Fellow ( Professor cadre)  – 4
a.      Computational Linguistics -1
b.      Language Technology -1
c.      Speech Technology -1
d.      Corpus Linguistics -1
3.      Fellow ( Associate Professor cadre)  - 8
a.      Computational Linguistics  -2
b.      Language Technology -2
c.      Speech Technology -2
d.      Corpus Linguistics -2
4.      Associate (Assistant Professor cadre)  – 18
a.      Computational Linguistics - 3
b.      Language Technology -3
c.      Speech Technology -3
d.      Corpus Linguistics -3
e.      Statistical Computational Linguistics – 3
f.       Tamil Linguistics – 3
5.      Project Fellow -  25
a.      Computational Linguistics – 5
b.      Language Technology – 5
c.      Speech Technology – 5
d.      Corpus Linguistics – 5
e.      Tamil Linguistics - 5  
Adjunct Faculty (8)
a.      Senior Research Fellow – 4
b.      Fellow - 4  

Technical Staff: ( 31)
a.      System Administrator – 5
b.      System Analyst  – 5
c.      Speech Lab Engineer - 1
d.      Senior Hardware Engineer -1
e.      Junior Hardware Engineer -2
f.       Programmers -  5
g.      Electrical Engineer – 1
h.      Data Entry operator – 5
i.       Speech Lab Technician -1
j.       Electrician – 2
k.      Librarian – 1
l.       Deputy Librarian – 1
m.    Assistant Librarian - 1

Administrative staff: (4 3)
a.      Administrative officer – 1
b.      Accountant – 1
c.      Personal Secretary - 2
d.      Section Officer – 5
e.      Assistant Section Officer – 5
f.       Assistant - 5
g.      Steno - 4
h.      Typist - 5
i.       Sergeant - 2
j.       Lab Attender – 5
k.      Office Assistant Watchman - 4
l.       Watchman -4
m.    Sweeper – 3
n.      Driver -2


 Output at the end of fifth year:
At the end of fifth year, the following works for Tamil  language will be completed.
Resource and Research Tools:
1.      Linguistically annotated Corpus  
2.      Parallel Corpus
3.      Computational Lexicon
4.      Morphological Parser
5.      Word-class Taggers
6.      Syntactic Parser
7.      Semantic Analyzer
8.      Acoustic Phonetic analysis
9.      Character ( graphemic) analysis
Application software:
1.      Unicode Fonts
2.      Keyboards
3.      Spell-checkers
4.      Grammar –checkers
5.      Automatic Speech Recognizer ( ASR)
6.      Text-to-Speech ( TTS)
7.      Optical Character Recognizer ( OCR)
8.      Information Retrieval and Extraction
9.      Machine Translation System ( Tamil – Hindi – English)


The financial estimate for the above all would be around 50 Crores in total for five years.














3 கருத்துகள்:

பெயரில்லா சொன்னது…

Graet - Great Plan - May I suggest you to incorporate Socio Lingustcs- Tribal Ligustics - Development of ASR with special reference to Tamil -and Tamil dialects - Application of Educational Technology to teach and train Lingistics - Unicode fonts in a wider specturm

ந.தெய்வ சுந்தரம் சொன்னது…

கணினிமொழியியல் , மொழித்தொழில்நுட்பம் தொடர்பான எதையும் நாம் இணைத்துக்கொள்ளலாம். முதலில் இதுபோன்ற நிறுவனம் அமையட்டும். ஆனால் ஒன்றை நாம் நினைவில் கொள்ளவேண்டும். இது ஒரு மொழியியல் நிறுவனம் இல்லை. மொழித்தொழில்நுட்பத்தோடு தமிழை இணைத்து மேற்கொள்ளும் திட்டங்களுக்கான நிறுவனமாக இது அமையவேண்டும்.

G. Balasubramanian சொன்னது…

முயற்சி ஒருநாள் பலன் தரும் என நம்புவோம்.

கருத்துரையிடுக

 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes | Hot Sonakshi Sinha, Car Price in India