Its0908LinguisticMarkup
ITS WG Collaborative editing page
Follow the conventions for editing this page.
Status: Initial Draft ie. please focus on technical content, rather than wordsmithing at this stage.
Author: Goutam Kumar Saha
Computational Linguistic Markup (CLM)
Summary
The proposed scheme is to demonstrate how to embed syntactic, semantic and computational linguistic related metadata information in the structure of an XML document towards faster and better meaningful Translation through useful Markups meant for both the Internationalization (I18N)& Localization (L10N) Processes.
[R023] To improve the translation process, it should be easy to take advantage of the capability of XML to embed linguistic-related metadata information in the structure of a document.
[Yves Savourel- This is just a first try to get things rolling. Goutam, Andrzej, anyone, please, to comments/add/etc.]]
[Andrzej Zydron- This is a tremendous piece of work by Goutam. It is also an area that I am very interested in and would like to assist Goutam in developing this topic when we are ready to deliberate it in detail. Well done!.]]
[Martin J. Duerst- I'm totally amazed by the amount of work that is going into this page. However, this shows that the whole field is huge. The properties/elements proposed in this page already seem to be as many as those proposed in all the rest together. In my view, we should continue working on linguistic markup, but for the actual work on a Recommendation concentrate on the other, simpler problems first, and maybe save linguistic markup for Version 2 of ITS.]]
[[Goutam Saha- Excellent remarks/comments indeed. It will be nice if AZ, MJD, FS, RI,YS or anyone add and share their knowledge, at any suitable time, on this recently proposed scheme. Readers may please give language specific feedback or any suggestions also to goutam.k.saha@kolkatacdac.in, sahagk@gmail.com ]]
[[Felix Sasaki- Goutam, as I said before, and - sharing AZ's comment - I think this is a tremendous piece of work. Nevertheless, I share MJD view that currently we are very busy with the preparation of the first version of the ITS tag set. We do not want to stop your work on the topic of linguistic markup, but we can not take it into account in the near future. I tried to make that clear before, I hope that you understand that.]]
[Felix Sasaki- Maybe it was not clear in the minutes of the ITS f2f at ERCIM in September, but that was what we decided to do. Goutam, I hope that I understood you correct that you would agree on what Martin formulated - that we solve the current simple (they are hard enough) problems of ITS first and come back to linguistic markup later.]]
The 3-Tier XML Schema approach is useful for an XML content-author to embed a source human language specific metadata information in an XML document. This is a significant step forward toward internationalization and localization processes. An author does not need to markup every parts of his/ her document. Use such markups only at very language specific parts and thus the content does not get overweighted with extra markups.
In this 3-Tier or 3-Layer XML Schema, the 1st Layer Schema for the Content Domain is intended to add metadata regarding the context of a paragraph or of whole text of a web content. This content_domain metadata helps the translation process to understand the context easily and as a result, an appropriate terminology (e.g., out of the synonyms for a word in source language) can be sought out from the lexicon/dictionary at the localizer's end. Again, '''Content Domain Metadata''' guides also the localization process on content writing style. For an exaple, a sport content should be translated or localized as a sport content only using sport terminologies; it should not be translated or localized using funeral terminology that might result in a funeral content. The 2nd Layer Schema or the Sentence Level Schema takes care of the sentences in a paragraph or in a full text for a particular content domain of web content. Such markup on sentence level is intended to provide much useful syntactic and semantic metadata regarding a sentence. Metadata on Sentence Syntax / Formation (for example, simple, complex and compound etc) and the metadata on Sentence Semantics (for example, demonstrative, interrogative or taunting etc) are of great importance to a translator. A sentence's Syntactic metadata eases and quickens both lexical and systax analysis during the parsing of a source language sentence. A sentence's Semantic Metadata is very much useful in both semantic analysis and in generation of more meaningful sentence in a target (human) language. The 3rd layer Schema or the Word Level Schema is to add markup on word category (e.g., parts-of-speech or pos etc) and to provide morphological information (e.g., person, gender, number, suffix or prefix and stem of a word etc), for some specific languages. Word level metadata is of an immense to the process of word's POS disambiguation. Word level metadata is also useful for the word chunking process. For a very specific word, meaning of a word is also provided to a localizer through this word level markup as an attribute's value. Ahgain, both the Sentence and Word Level metadata are for Word Sense Disambiguation also toward more meaningful Internationalization and localization.
This is a write-up on how to embed syntactic, semantic and computational linguistic related metadata information in the structure of an XML document towards better translation.
XML Schema authors should prefer to use attributes for metadata information because of their better flexibility and portability.' XML content- authors having school level language grammar knowledge will not find any difficulty in marking up such language specific information. It is not mandatory for an author to add finer classified metadata at all. He/she has to add metadata at some parts of his/her content which are exceptionally special with respect to his/her source language aspects. In this proposed approach, we need to develop the 1st XML schema that specifies various categories on content domain. The 2nd XML schema is to specify various categories on sentences and the 3rd XML schema specifies various word-level categories (Parts-of-Speech etc).
For example: for the Phrases_Idioms "cats and dogs" in english, we can markup as
<!-- Markup for Phrases and Idioms --> <sentence_cat name="phrases_idioms" meaning="heavily"> cats and dogs </sentence_cat>
Such metadata will be of an immense help to localazation process (in order to find an appropriate phrases & idioms in a target language) without even knowing the source language- english well.
Similarly, in Bangla- source language, for the phrases & idioms say, " Dumurer (english meaning is Fig's) Fool (english meaning is Flower)" we can markup as
<!-- Markup for phrases and idioms --> <sentence_cat name="phrases_idioms" meaning="rarely visible"> Dumurer Fool </sentence_cat>
Such metadata is very useful as a semantic markup to a localization process, irrespective of a target language.
Again, for a link inside an XML PCDATA/ text content, we might differentiate links from the text by the following markup to treat them separately, for an example, Click Here for Sign Up
<!-- Markup for a Link --> <sentence_cat name="link"> __Click Here for Sign Up__ </sentence_cat>
or, for the link-word say, Here, we might markup in the following way:
<!-- Markup for a Link Word --> <pos_cat name="link"> __Here__ </pos_cat>
For the following Bengali or Bangla dialect sentence "Kaam (Kaaj in Bangla or Work in english) Saira Falo (Shesh Koro in Bangla or Complete in english)," we should markup the text with the three-layer metadata information in the following way:
<!-- Markup for Dialect --> <text xml:lang="ben"> <content_domain name="dialect"> <!-- content domain metadata --> .... other sentences .... <sentence_cat name="imperative"> <!-- sentence level metadata is optional here --> <pos_cat name="noun" meaning="work"> Kam </pos_cat> <pos_cat name="verb" meaning="complete"> Saira Falo </pos_cat> <!-- word level parts-of-speech --> </sentence_cat> ...... </content_domain> </text>
Metadata information about the domain, sentence type or specific words
will help translators to do better quality work or to do the work quickly.
If translators know that a word belongs to a specific domain then they can go to a
terminology database and check the word, thus, even for human translators this
3-Tier or 3-layer schema will be helpful. One cannot do an accurate translation
without such information.
For example:
<!-- Markup for Word/Phrase Sense Disambiguation or for Context Dependent Usage --> <content_domain name="factory"> Paul works in a factory. There are many electromechanical machines in this factory. ...sentence_i Give <pos_cat name="noun" type="material"> oil </pos_cat> . ...sentence_k </content_domain> <content_domain name="office"> Paul works in a government office. He advises a new employee. <pos_cat name="verb" type="joining" meaning="please"> Give oil </pos_cat> to your senior. <!--such information via attributes aim to provide semantic as well as language specific transformation constructs to a translator--> ...sentence_m </content_domain>
Background Knowledge for Word-Level Parts-of-Speech Markups:-
Verb :- A verb plays an important role for a word level markup or parts-of-speech. A verb is a doing word. Joining verb is a combination of noun followed by verb. Compound verb is a combination of two verbs. Primary verb can express complete meaning without the need of any other verb. Auxiliary verb helps to express tense, mood and voice. A verb that has an object is called transitive verb . Intransitive verb has no object. Noun verb is derived from a noun. Imperative verb is to denote an order, request or command. Non-Finite Verb or Verbal (think "unfinished") cannot, by themselves be main verbs. For example, the verb "broken" in "The broken window fell down." Causative Verbs designate the action necessary to cause another action to happen. For example, the verb "made" in "He made me do it." The verb "made" causes the "do" to happen. Other causative verbs are say, help, allow, motivate, make, force etc. Phrasal/Group Verbs consist of a verb and a preposition. The resulting combination creates a new verb. For example, the "called off" in "The Chairman called off the meeting." is a group or phrasal verb. Other examples of group verbs are: turn up, put on, put off, get down, called on etc. A linking verb connects a subject and its complement. Sometimes called copulas, linking verbs are often forms of the verb to be. A Finite Verb is one that has tense. Further, it has to agree with its subject in person and number, so its form changes accordingly: "She loves Paul.", "We love Ram." The subject and its finite verb depend on each other. A plural subject needs a plural verb. Similarly, a plural finite verb needs a plural subject. In other words, the relation between a subject and its finite verb is interdependent. A finite verb is an essential ingredient in a sentence. Even the simplest form requires Subject + Finite Verb to make a sentence: "Birds fly .", "Fire burns ."
The forms of a Non-finite Verb is invariant because it is not affected by the (subject-verb) concord system: "He likes to swim .", "They like to swim .", "He likes eating .", " Having worked hard he felt tired." Non-finite verbs are not essential in a sentence. They are needed just to expand a sentence in order to express various kinds of meanings, so we cannot have a sentence with subject + non-finite verb without a finite verb. For example, we don't say: "Children to fly kites ." Instead we say: "Children like to fly kites." Here, like is a finite verb and to fly is a non-finite verb. Non-finite verb has the structures: (i) to + verb , (ii) Anaphoric to (or to without verb, e.g., "Yes, I would love to . " (the omitted verb after to say, "dance" here, is to be learnt through discourse analysis ). )
<!-- Markup for Anaphoric --> <!-- Markup for "Yes, I Would love to. " --> <sentence_cat name="assertive"> <!-- sentence level metadata is optional here --> Would you like to dance with her ? Yes, I would love <pos_cat name="verb" type="anaphoric"> to </pos_cat> </sentence_cat>
Joining or Conjunct Verb is a verb that is formed by a noun or adjective followed by a verb. Usage of such verbs is very common in Indian languages e.g, in Bengali- Sanchai (savings) Koro (do) i.e, in English- to save, and Manush (Man) Kora (do) or in English- to bring up.
However, there are many usages of such verbs with special sense / meaning (pragmatic), e.g, Mukh (Mouth) Kora (do) ie, to rebuke in English, Mukh (mouth) Kholo (open) ie, to protest in English etc.
Markup for such special usage of verb is shown below.
<!-- Markup to express pragmatics of special usage of Verbs (in Bengali) --> <pos_cat name="noun" meaning="office's">Officer</pos_cat> <pos_cat name="noun" meaning="prag_livelihood">Anno</pos_cat> <!-- Dictionary meaning of "Anno" is Rice (or bread) --> <pos_cat name="noun" meaning="my">Aamar</pos_cat> <pos_cat name="verb" meaning="prag_ about to loose">Utthte Cholechhe </pos_cat> <!-- Dictionary meaning of "Utthte" is "to rise" and "Cholechhe" means "to run or go" --> | <!-- English meaning of the sentence- "(I am) about to loose my (office) job" --> <!-- Another eaxmple of Pragmatic Usage --> <pos_cat name="noun" meaning="Boss's">Bosser</pos_cat> <pos_cat name="verb" type="conjunct" meaning="prag_is not pleased">Mon Otthe Ni </pos_cat> | <!--Lexical meaning: Mon (mind) Otthe (rise) Ni (no)--> | <!-- Another eaxmple of Pragmatic Usage (in Bengali: Ronaldo Dhare Kaatbe Na Bhaare ?) Here, Dhaare (on credit or sharp edge), Kaatbe (will book or will cut) and Bhaare (by weight). In such case, unless we know what is Ronaldo, we cannot catch the pragmatic sense of the sentence. --> <pos_cat name="noun" type="proper" meaning="a player">Ronaldo</pos_cat> <pos_cat name="verb" type="conjunct" meaning="prag_whether will be able to play upto his reputation">Dhare Kaatbe Na Bhaare </pos_cat> ? <!-- Na (no), here "Na" (whether) indicates a sense of uncertainty -->
Noun:-
A noun is a naming word.
Proper Noun names a specific people or place or thing (e.g. Goutam, Kolkata, India).
Common Noun refers to a class of objects or a concept as opposed to a particular individual (e.g. boy, cow). Collective Noun is a noun that denotes a group of individuals (e.g. army, assembly, family). Abstract Noun is a noun that denotes an abstract or intangible concept, such as happiness, envy or joy. Material Noun denotes the matter from which something is
or can be made (e.g. cloth, oil). Compound Noun is a noun made up of two or more lexemes, such as flowerpot, southeast. Here, nouns are combined into compound structures.
Verbal Noun is a noun which is formed as an inflection of a verb and partly sharing its constructions, such as smoking in "Smoking is injurious to health". Numeralsinclude all numbers, whether as words or as digits. They may be divided into two major types.
CARDINAL Nouns include words like: nought, zero, one, two,fifty-six, a thousand.
ORDINAL Numbers include first, 2nd, third, fourth, 500th.
Numbers Noun: 20, 567. We classify numerals as a subclass of nouns because in certain circumstances they can take plurals: five twos are ten ; he's in his forties;
How many 5s in 20? They may also take the: the third of October ; a product of the 2004s.
Fractional Number Noun: One-half, two-third. (e.g. Four one-fourths make one.)
Preceeding Noun of Title:- Dr., Mr., Ms.
Noun - Unit of Measurement:- K.M., K.G.
Negative Noun:- He says "no".
Hyphenated Numbers:- 30-40, 1990-2005.
Following Noun of Title: M.B.A., B.S., M.S., Ph.D.
In Indian languages (eg., Bengali, Malayalam and Hindi etc,)often we find the usage of Repetitive Noun and Echo Noun. In English, we don't find such often use of Repetitive Nouns. Echo type of Noun is for example, (Bengali word) Cha-Ta (tea etc) and Kapor (cloth) -Chopor (to mean cloth, shirts etc). However, the second part of of the Echo noun (eg, Chopor) does not have any meaning on its own. But, it carries very important pragmatic value as it points the meaning of the first part plus the additional related items (or, in other words, (first part of the ECHO Noun)++). Repetitive Noun carries various pragmatic values for example, Ghantai-Ghantai (almost in every hour- showing repetition), Ghare-Ghare (almost in every house- showing plenty), Chokhe (eye)-Chokhe (eye) (to keep in close-watch), Sheet(coldness)-Sheet (means- little cold) and Paye(leg)-Paye(leg) (means- to walk slowly with hesitation) etc.
<!-- Examples on Markup for Bengali Repetitive Word --> <!-- "Ghare (house) Ghare (house) Computer (Achhe)" OR Almost every house has computer --> <pos_cat name="noun" type="repetitive" meaning="almost every house">Ghare Ghare</pos_cat> <pos_cat name="noun" type="common">Computer</pos_cat> <pos_cat name="verb" meaning="has"> </pos_cat> | <!-- "Ghure (travel) Ghure (travel) Aami (I) Klanto (tired)" OR I got tired because of prolonged tour --> <!-- Here, the repetitive word expresses prolongation (a pragmatic issue) --> <pos_cat name="adverb" meaning="prolonged tour">Ghure Ghure</pos_cat> <pos_cat name="pronoun" meaning="I"> Aami </pos_cat> <pos_cat name="adjective" meaning="got tired"> Klanto </pos_cat> | <!-- "Sakal (morning) Sakal (morning) Eso (come)" OR Come before schedule time --> <pos_cat name="adverb" meaning="before schedule time">Sakal Sakal</pos_cat> <pos_cat name="verb" meaning="come"> Eso </pos_cat> | <!-- "Vikharitir (the beggar's) Jai (go) Jai (go) Abostha (state or condition)" OR The beggar is almost dying --> <pos_cat name="noun" type="common" meaning="The beggar's"> Vikharitir </pos_cat> <pos_cat name="noun" type="abstract" meaning="condition"> Abostha </pos_cat> <pos_cat name="verb" type="repetitive" meaning="imminent death "> Jai Jai </pos_cat> | <!-- here, this repetitive word expresses imminence --> <!-- "Chhelera (boys) Pulish (Police) Pulish (Police) Khelchhe (are playing)" OR The boys are playing an "imitative" game of being Police --> | <!-- "Murgi (Chicken) Turgi (?) Khabo (will eat)" OR I will eat chicken (here, the echo word "Turgi" has no meaning, but it expresses eagerness to have foods like chicken) --> <pos_cat name="noun" meaning="chicken"> Murgi</pos_cat> <pos_cat name="noun" type="echo" meaning="pragmatic_eagerness"> Turgi </pos_cat> Khabo | <!-- "Murgi (Chicken) Furgi (?) Khabo Na (will not eat)" OR I won't eat chicken (here, the echo word "Furgi" has no meaning, but it expresses irritation or annoyance) --> <pos_cat name="noun" meaning="chicken"> Murgi</pos_cat> <pos_cat name="noun" type="echo" meaning="pragmatic_irritation"> Furgi </pos_cat> Khabo Na | <!-- "Cha (Tea) Fa (?) Khabo Na (will not drink)" OR I won't drink tea (here, the echo word "Fa" has no meaning, but it expresses irritation or annoyance) --> <!-- "Mangso (Mutton) Fangso (?) Khai (eat) Na (no)" OR I don't eat mutton (here, the echo word "Fangso" has no meaning, but it expresses irritation or annoyance) --> <pos_cat name="noun" meaning="mutton"> Mangso </pos_cat> <pos_cat name="noun" type="echo" meaning="pragmatic_irritation"> Fangso </pos_cat> Khai Na | <!-- There are many repetitive words that are formed by two meaningful words but the meaning of the second word remains absent, e.g., BhaloMondo where Bhalo means "good" and Mondo means "bad" but the the overall meaning of BhaloMondo is good only. The second word's meaning is not considered here -->
ACRONYM or Abbreviation needs to be addressed for better Internationalization & Localization for a Web Content or Software messages.
Acronym is a word formed from the initial letters of a series of words. (eg, IEEE is an acronym for Institute of Electrical and Electronics Engineers).
Abbreviation is a shortened form of a word or words, either the use of initials instead of a whole word, for example: TTI (Teacher Training Institute) or the first few letters, for example: approx. for approximately.
<!-- Markup for ACRONYM or Abbreviation --> <pos_cat name="noun" type="proper" meaning="acronym_Hindustan Aeronautics Limited"> HAL </pos_cat> <!-- Markup for Abbreviation --> <pos_cat name="adjective" meaning="abbreviation_approximate"> approx </pos_cat> <!-- Markup for Abbreviation for Bengali words বিশেষ দ্রষ্টব্য or বি দ্র (Special Note)--> <pos_cat name="adjective" meaning="abbreviation_special note">বি দ্র </pos_cat> <!-- Markup for Abbreviation for Bengali words পোষ্ট অফিস or পো (Post Office)--> <pos_cat name="adjective" meaning="abbreviation_post office"> পো </pos_cat> <!-- Markup for Abbreviation for Bengali word গ্রাম (GRAM or Village)--> <pos_cat name="adjective" meaning="abbreviation_village"> গ্রা </pos_cat> <!-- Markup for ACRONYM or Abbreviation in Bangla for HAL--> <!-- For proper Browser viewing (for Indian Languages e.g, Bangla, Hindi etc scripts), use atleast Medium Text Size --> <pos_cat name="noun" type="proper" meaning="acronym_HAL"> এইচ এ এল </pos_cat> <!-- without using the above mentioned "meaning" attribute, localizer may translate it to AICH A EL -->
Pronouns:-
Generally pronouns stand for (pro + noun) or refer to a noun, an individual or
individuals or thing or things.
Personal Pronouns stand for persons or things e.g. I, me, my, you, he, they.
The family of Demonstrative Pronouns (who/which/that/this) behaves as
pronouns. The Relative Pronouns (who/whoever/which/that) relate groups of words
to nouns or other pronouns (The student who studies hardest usually does the
best.). Indefinite Pronouns everybody/anybody/somebody/ all/each/every/some/
none/one) do not substitute for specific nouns but function themselves as nouns
(Everyone is wondering if any is left.). The Intensive Pronouns (such as myself, yourself, herself, ourselves, themselves) consist of a personal pronoun plus self or selves and emphasize
a noun. (I myself don't know the answer.) The Reflexive Pronouns(which have the same forms as the intensive pronouns) indicate that the sentence subject also receives the action of the verb.
(Students who cheat on this quiz are only hurting themselves.) The Interrogative Pronouns (who/which/what) introduce questions. (What is that? Who will help me? Which do you prefer?)
The Reciprocal Pronouns are each other and one another. They are convenient forms for combining ideas. (They gave books to each other).
Adverbs:- Adverbs are words that modify a verb (He drives slowly . — How does he drive?), an adjective (He drives a very fast car. — How fast is his car?) and another adverb (He moves quite slowly. — How slowly does he move?). Types of Adverbs:- Adverbs of Manner or Abstract Adverb: He moves slowly and spoke quietly. Adverbs of Place: Go there. Come here. Adverbs of Frequency: He often acts. He comes every day . Adverbs of Time: Always tell the truth. When are you coming? He finished his tea first. He left early. Adjective Adjective Adverbs: He is very intelligent. He is too good. Adjective Adverb: Do not walk so fast.
Prepositions:- A preposition describes a relationship between other words in a sentence. You can sit before the table (or in front of the table). Other prepositions: on, behind, under, beneath, beside, next to, before, between, into, to, through, off, over, upon, across, of, about, in, for, without, toward, around, at, against, during, until and throughout and after.
Adjectives are words that describe or modify another person or thing in the sentence. Adjectives are the words that describe a noun or pronoun. Kinds of Adjectives:- General / Quality Adjective: to show the kind or quality of a noun or pronoun, e.g. large, good, fresh, honest etc. Abstract Adjective: wounded, rich, poor etc. Quantitative Adjective: some, much, little, enough, whole, half etc. Ordinal Adjective: first, second, third etc. Numeral Adjective: five, few, many, most, several, sundry etc. Proper Nounian / Nominal Adjectives are formed from Proper nouns (e.g., Indian tea, French wines etc.) Material Adjective: sandy, earthen, golden etc. Compound Adjective: dew-wet, open-heart, clean ruddy farmer, long white basket etc. Interrogative Adjective: Whose book is this ? Which way shall we go ? What manner of man is he ? Exclamatory Adjective: What an idea ! What a blessing ! Emphasizing Adjective: I saw it with my own eyes. That is the very thing we want. Comparative Adjective: better, smaller, taller etc. Superlative Adjective: best, smallest, tallest, worst etc. Demonstrative Adjective: This boy is strong. That boy is industrious. Don't be in such a hurry. I hate such things. These mangoes are sweet.
In most of the languages, normally adjective precedes a noun. However, when an adjective (of numeral type) follows a noun then it indicates an approximate amount only. For an example, in Bangla, Aami(I) Panch (five) Lac Taka (rupees) Chaai (need) or I need five lac rupees in English. Here, it indicates exactly five lac only. On the otherhand, in the Bangla sentence- Aami Lac Panchek Taka Chai- it indicates an approximation, i.e., I need about five lac rupees. Here, lac is a noun of the category of unit of measurement and five is a numeral adjective. In such cases, we need to markup in the following way.
<text xml:lang="ben"> <sentence_cat name="assertive"> Aami <pos_cat name="adjective" meaning="five">Panch</pos_cat> <pos_cat name="noun"> Lac</pos_cat> <!-- numeral adjective precedes a noun - indicates an exact amount --> Taka Chai </sentence_cat> <sentence_cat name="assertive"> Aami <pos_cat name="noun"> Lac</pos_cat> <pos_cat name="adjective" meaning="approximately five"> Panchek</pos_cat> <!-- numeral adjective follows a noun - indicates an approximate amount --> Taka Chai </sentence_cat> </text>
A typical example is shown below in order to show how grammatical knowledge can be used as XML markups for the sentence "This boy is strong." However, word-level markups for all words in a sentence may not be required. We need to markup only at language specific parts.
<!-- Example for Word- Level Markup --> <pos_cat name="adjective" type="demonstrative"> This </pos_cat> <pos_cat name="noun" type="common"> boy </pos_cat> <pos_cat name="verb" type="linking"> is </pos_cat> <pos_cat name="adjective" type="general"> strong </pos_cat> .
Examples on I18n / Localization sensitive Markups used in Postal Address etc.
<pos_cat name="noun" meaning="Post Office Box">P.O. Box</pos_cat>3625 <!-- P.O. Box or Postfach (Germany) Or Case Postale (France) in Mailing Address --> <pos_cat name="noun" meaning="Postal Index Number or ZIP">PIN</pos_cat>700091 <!-- PIN (India) or ZIP (USA) in Mailing Address --> <pos_cat name="noun" meaning="Village">Gram</pos_cat>Debogram <!-- here, Gram indicates a village or county (not the measurement unit) --> <!-- Markup example using Road and lane --> VIP Road, Shastribagan<pos_cat name="noun" meaning="lane">Goli</pos_cat> <!-- here, Goli indicates a narrow road or a lane -->
The same sentence "This boy is strong." can also be marked up in the following way without using finer parts-of-speech categories (depending on the requirements of a translation parser for a specific language-pair).
<!-- Example for Adding Word-Level Markup --> <pos_cat name="adjective"> This </pos_cat> <pos_cat name="noun"> boy </pos_cat> <pos_cat name="verb"> is </pos_cat> <pos_cat name="adjective"> strong </pos_cat> .
The sentence "Light a light light." can be marked up with word-level parts-of-speech metadata information in the following way without using finer parts-of-speech categories (depending on the requirements of a translation parser for a specifc language-pair).
<!-- Markup for word-level POS Disambiguation --> <pos_cat name="verb"> Light </pos_cat> a <pos_cat name="adjective"> light </pos_cat> <pos_cat name="noun"> light </pos_cat> .
Interjections are words or phrases used to exclaim or protest or command.
They sometimes stand by themselves, but they are often contained within larger
structures. Interjections are used to express some sudden feeling or emotion (e.g.,
bravo !, Hurrah ! etc. )
Wow! I won the lottery. Hush ! Don't make a noise. Ah ! Has he gone ?
Oh, I don't know about that. Alas ! He is dead. Hello ! What are you doing
there ?
A Conjunctions is a word which merely joins together sentences, and sometimes words. Kinds of Conjunctions:- Co-ordinating Conjunction: and, or, nor etc. Adversative Co-ordinating: but, still etc. Disjunctive/ Alternative Co-ordinating Conjunction: or, nor, else, neither etc. Conclusive Co-ordinating Conjunction: for, so, therefore etc. Subordinating Conjunction: because, if, though, till, as, unless, although, than etc. Eternal Joined Conjunction: if ... then, when .... then, either.. or, neither... nor, though ...yet, not only ... but also, whether ... or, both ... and etc. (e.g., Either take it or leave it.)
Indeclinable: Indeclinables are the words (mostly used in Indian languages) that do not change their forms at all in a sentence [e.g., in Bangla: Pravriti (etc.), Sange (with), Ittyadi (etc.) Mato (like), Binaa (without), Pichhone (behind), Abdhi (upto), Theke (from), Hatthat (sundden), Jeno (as if), Maane (that is), Aboshyoi (certainly), Baye (left), Daine (right), Nyai (alike) etc. In Oriya: Ru (From), Paai (for), Nishchityovaabore (certainly) etc. In Hindi: Saath (with), Se (from), Ittyadi (etc.), Jaise (as if) etc.]
Post Position: Post Positions are the words that are used after nouns or pronouns [e.g., in Bangla: Theke (from), Hote (from), Hoite (Bangla formal form, English meaning is "from"), Upore (on), Bhitore (inside) etc., in Oriya: Bhitore (inside) etc., in Hindi: Se (from), Andor (inside) etc.] There are many human languages (e.g., Indian languages) that do not have prepositions. Rather these languages use postpositions.
Ending: Indian languages use various endings along with various words to express tense, case etc. Kinds of Ending in Bangla: (a) Tense Ending (e.g., chhilaam in "AAmi Korchhilaam" (I was doing), bo in "Aami Korbo" (I shall do), chhi, e, taam, i, o, te etc.), (b) Case Ending (e.g., ke in "Aamaake" (me), te in "Ghare" (at room), aar in "Tomaar" (your), der etc.), (c) Personal Ending (e.g., s in "Jaas" (go), o in "Jeo" (Please go) etc.), (d) Imperative Ending (e.g., o in "Toomi Gaao" (You sing), o in "Toomi Jaao" (you go) etc.), (e) Participle / Principal Ending (e.g., e in "Kheye" (after eating) etc.)
Determiners are words like a, an, the, this, that, these, those, every, each, some, any, my, his, one, two etc., which determine or limit the meaning of the nouns that follow. All determiners except a, an, the are generally classed among adjectives.
Link word is to denote a link to some web page etc.
Punctuation: (a) Comma: , (b) Sentence Final: . ! ? | (c) Quote: ' " (d) Left Parenthesis: ( [ { < (e) Right Parenthesis: ) ] } > (f) Mid-Sentence Punctuation: : ; - (g) Others: + - % ^ & * / \ @ $
Usage/Dialogue specific Markups: in many languages (e.g. in Bengali), a group {<verb> <negative verb>} sometimes implies a repeated begging, for an example, "Din (give) Naa (not)" or, simply translated output is "Don't give," which is incorrect. Rather, it gives us a sense of begging only to mean "give." We need to ignore negative verb here, when we translate it semantically. Similarly, {<verb><negative verb>} may also indicate a sense "to request earnestly," "to cajole" or "to persuade." For an example in Bangla, "Boloon (say) Na (no)," or "Balo (say) Na (no)" we need to ignore the negative verb "Na" for proper localization. Examples:
<!-- Markup showing some usage specific examples where we need to omit the nagation part of a verb and to add the sense of begging --> <!-- e.g. Bengali Usage: "Baloon Na, Sir" or in English: "Kindly Say, Sir" --> <pos_cat name="verb" type="omit_negation" meaning="kindly say">Baloon Na</pos_cat>,Sir <!-- e.g. Bengali Usage: "Aapni (you) Ebar Baloon" or in English: "You say now" --> <pos_cat name="verb" meaning="say">Baloon</pos_cat>
We often observe "hesitation phenomenon" during our daily conversation for example, in
English we often use "well" during our daily conversation with the intention of taking some
time before talking about something (e.g., during question answer session). Hesitation
Phenomenon for example, the word we generate during our talk in hesitation: "Aya.. Aya .." in direct speech also needs to be handled properly for I18N and L10N.
In Bengali we often use ধর (to assume) গিয়ে (on going) or হোল (is) গিয়ে (on going)
during our conversation. Actually, the word " ধর গিয়ে " is intended to mean "assume" and
the word " হোল গিয়ে " is intended to mean "is or are" only. The second part গিয়ে is to be
ommitted in such occurrences of "hesitation phenomenon." Similarly, the English word
"well" in such "hesitation phenomenon" needs to be neglected or handled correctly during
our conversation for localization process. All human languages might have such issues. The markup for "Hesitation Phenomenon" is stated below:
<!-- Example for Word-level Hesitation Markup --> <pos_cat name="verb" type="hesitation" meaning="to assume"> ধর গিয়ে </pos_cat> <pos_cat name="verb" type="compound" meaning="go and catch" > ধর গিয়ে </pos_cat> <pos_cat name="verb" type="hesitation" meaning="assume"> হোল গিয়ে </pos_cat> <pos_cat name="adverb" meaning="in good manner"> well </pos_cat> <pos_cat name="adverb" type="hesitation" meaning=""> well </pos_cat> <pos_cat name="adjective" meaning="in good health"> well </pos_cat> <pos_cat name="interjection" type="hesitation" meaning=""> য়্যা য়্যা </pos_cat>
These markups are also useful for a content author to add disambiguation related metadata information in order to disambiguate a text / PCDATA in between "<" and ">" from element tags . For an example, for the text say: Readers may refer to work in <Saha2005> for more information.
Please note that though <Saha2005> looks identical to an element tag but it is not intended to mean it as an element tag. Rather, it is meant for readers' references only. How to convey such disambiguation information to an XML Parser ? Solution to this problem is to markup the text in the following way in order to denote that <Saha2005> is not meant for an element tag.
<!-- Markup to Disambiguate between an element-tag and a text/PCDATA in between "<" and ">" --> Readers may refer to work in <pos_cat name="punctuation" type="left parenthesis"> < </pos_cat> Saha2005 <pos_cat name="punctuation" type="right parenthesis"> > </pos_cat> for more information.
There exists many kinds of date calendars, e.g., Bangla Calendar (Bangabdo), English calender, Shakabdo etc. Whenever we see some date it may not be English year & date. In such cases, it is better to indicate first the kind of calendar being considered and thus, we can internationalize the date. So, we may use the following markup for Date type data along with date format.
<!-- Markup for Date Internationalization --> <pos_cat name="date" type="dd/mm/yy" meaning="bangla_date"> 26/06/12 </pos_cat> <!-- Present year is 1412 in Bangabdo --> <pos_cat name="date" type="mm/dd/yyyy" meaning="bangla_date"> 06/27/1412 </pos_cat> <pos_cat name="date type="yy/mm/dd" meaning="english_date"> 05/10/17 </pos_cat> <!-- default may be English_date --> <pos_cat name="date" type="dd-mm-yyyy" meaning="english_date"> 17-10-2005 </pos_cat> <pos_cat name="date" type="dd MMM, yyyy" meaning="english_date"> 18 Oct, 2005 </pos_cat> <pos_cat name="date" type="MMM dd, yyyy" meaning="english_date"> Oct 18, 2005 </pos_cat> <pos_cat name="date" type="dd MMM, yyyy" meaning="bangla_date"> 29 Ash, 1412 </pos_cat> <!-- "Ash" stands for the Bangla Calender Month: Ashwin --> <pos_cat name="date" type="dd She MMMM, yyyy" meaning="bangla_date"> 22 She Ashwin, 1412 </pos_cat> <!-- 22nd Ashwin --> <pos_cat name="date" type="dd i MMMM, yyyy" meaning="bangla_date"> 12 i Ashwin, 1412 </pos_cat> <!-- 12th Ashwin --> <pos_cat name="date" type="dd MMMM, yyyy" meaning="malayalam_date"> 1 Madam, 1181 </pos_cat> <!-- 1st Madam (that is, the 1st month in the Malayalam Calendar, current year is 1811) -->
In various languages, we express Time in various ways. The markup for time is shown below.
<!-- Markup for Time I18N --> <pos_cat name="time" type="hh:mm" meaning="mm past hh"> 20:32 </pos_cat> <pos_cat name="time" type="mm to hh" meaning="mm to hh"> 28 to 9 </pos_cat> <pos_cat name="time" type="hh Bajkar mm" meaning="mm past hh"> 8 Bajkar 32 </pos_cat> <!-- 32 minutes past 8 (in Hindi) --> <pos_cat name="time" type="hh Ta Beje mm" meaning="mm past hh"> 8 Ta Beje 32 </pos_cat> <!-- 32 minutes past 8 (in Bangla) --> <pos_cat name="time" type="hh Ta Bajte mm" meaning="mm to hh"> 9 Ta Bajte 28 </pos_cat> <!-- 28 minutes to 9 (in Bangla) --> <pos_cat name="time" type="poune hh" meaning="15 to hh"> poune 9 </pos_cat> <!-- 15 minutes to 9 (in Bangla, Hindi etc) --> <pos_cat name="time" type="hh pou" meaning="15 to hh"> 9 pou </pos_cat> <!-- 15 minutes to 9 (in Telegu language) --> <pos_cat name="time" type="soaa hh" meaning="15 past hh"> soaa 9 </pos_cat> <!-- 15 minutes past 9 (in Bangla, Hindi etc) --> <pos_cat name="time" type="hh kal" meaning="15 past hh"> 9 kal </pos_cat> <!-- 15 minutes past 9 (in Malayalam) --> <pos_cat name="time" type="saare hh" meaning="30 past hh"> saare 9 </pos_cat> <!-- 30 minutes past 9 (in Bangla, Hindi etc) --> <pos_cat name="time" type="hh aara" meaning="30 past hh"> 9 aara </pos_cat> <!-- 30 minutes past 9 (in Telugu Language) --> <pos_cat name="time" type="hh ara" meaning="30 past hh"> 9 ara </pos_cat> <!-- 30 minutes past 9 (in Malayalam Language) --> <pos_cat name="time" type="der" meaning="30 past 1"> der baje </pos_cat> <!-- 30 minutes past 1 (in Bangla, Hindi etc) --> <pos_cat name="time" type="dhai" meaning="30 past 2"> dhai baje </pos_cat> <!-- 30 minutes past 2 (in Hindi etc) -->
We often see that an image (along with an embedded ToolTip text) is inserted
in a sentence. We intend to translate the sentence as well as the ToolTip text.
We may use the following markup.
<!-- Word-Level Markup for ToolTip text word embedded inside an Image --> <para> Click here <image source="begin.jpg" alt="begin" /> <pos_cat name="alt_value"> begin </pos_cat> to play now. </para>
In all human languages, there are certain words that have the same spelling but different meanings and different origins (etymologies). Such words, known as homonyms, are handled by the word-level attributes: "type" and "meaning". For example, the word "bank" in English: (a) a grassy bank (noun),
(b) bank (verb) an aircraft (to tilt), (c) borrow from the bank (noun) and (d) bank (verb) the money.
The following Word- Level Markups can be used for handling Personal Names in various conventions. This metadata is also useful in sorting various personal names.
<pos_cat name="noun" type="proper" meaning="person_first_middle_surname"> Goutam Kumar Saha </pos_cat> <pos_cat name="noun" type="proper" meaning="person_surname_first_middle"> Saha Goutam Kumar </pos_cat> <pos_cat name="noun" type="proper" meaning="person_first_surname_middle"> Goutam Saha Kumar </pos_cat> <!-- person's name with initials of the first and second name --> <pos_cat name="noun" type="proper" meaning="person_firstInit__middleInit_surname"> G K Saha </pos_cat> <!-- a typical example for person's name in south India (i.e. in Kannada, Malayalam, Tamil and Telugu Languages --> <!-- first one is meant for the initial of the ancestor's place, second one is meant for initial of father's given name, third one is meant for the given name of the person and the fourth one is meant for the surname or family name of the person --> <pos_cat name="noun" type="proper" meaning="person_ancestorplaceInit__fathernameInit_first_surname"> K M Rama Rao </pos_cat> <!-- a typical example for person's name in south India (i.e. in Kannada, Malayalam, Tamil and Telugu Languages --> <pos_cat name="noun" type="proper" meaning="person_ancestorplaceInit__fathernameInit_first"> K M Rama </pos_cat> <!-- a typical example of a person's name (a modern convention). Here, first name stands for the given name of a person and the second one for his wife --> <pos_cat name="noun" type="proper" meaning="person_first_wifename"> Rodger Ami </pos_cat> <!-- a typical example of a person's name (a modern convention). Here, first name stands for the given name of a person and the second one for her husband --> <pos_cat name="noun" type="proper" meaning="person_first_husbandname"> Ami Rodger </pos_cat>
MORPHOLOGICAL Metadata:
In many languages (for an example, Hindi and Urdu etc), verb form changes according to the gender, person and number of the subject of a sentence. In English and Bangla etc, a verb is not affected by the gender system at all because it has the same form for both masculine and feminine subjects. Again in Bangla, verb form does not depend on the number of noun (subject). This aspect is important in the Localization process.
For example:
(a) "Paul goes" takes the form "Paul Jata Hai" (Hindi); here, the subject- Paul is of masculine gender, third person and singular number, verb-form is "Jata Hai. "
(b) "Ami goes" takes the form "Ami Jatee Hai" (Hindi); here, the subject- Ami is feminine gender, third person and singular number, verb-form is "Jatee Hai."
(c) "They go" takes the form "O Jate Hain" (Hindi); here, the subject- "They" is considered to be of masculine gender, third person and plural number.
(d) "They go" takes the form "O Jatee Hain" (Hindi); here, the subject- "They" is considered to be of feminine gender, third person and plural number.
(e) "I go" takes the form "Mai Jata Hoon" (Hindi); here, the subject- "I" is of masculine gender, first person and singular number.
(f) "I go" takes the form "Mai Jatee Hoon" (Hindi); here, the subject- "I" is considered to be of feminine gender (i.e. a girl says "I go"), first person, singular nember.
(g) "You go" will be "Tum Jate Ho" (Hindi); here, "you" is meant to a male (second person, singular number).
(h) "You go" will be "Tum Jatee Ho" (Hindi); here, "you" refers to a female (second person, singular number).
So, we need to add such Morphological metadata for better internationalization & localization (e.g., localizing English content to Hindi).
At the same time, such Morphological Markups help to generate linguistic resources on source language also.
<!-- Morphological Metadata on grammatical Concord for I18N & L10N --> <pos_cat name="noun" type="proper" meaning="male_per3_num1">Paul</pos_cat> <pos_cat name="verb">goes</pos_cat> <!-- Masculine Gender 3rd Person Singular Number --> OR, <pos_cat name="noun" type="proper" meaning="gpn_m_3_1">Paul</pos_cat> <!-- g(gender)p(person)n(number)_m(male)_3(3rdPerson)_1(singular number) --> <pos_cat name="verb">goes</pos_cat> <!-- Masculine Gender 3rd Person Singular Number --> <pos_cat name="noun" type="proper" meaning="female_per3_num1">Ami</pos_cat> <pos_cat name="verb">goes</pos_cat> <!-- Feminine Gender 3rd Person Singular Number --> OR, <pos_cat name="noun" type="proper" meaning="gpn_f_3_1">Ami</pos_cat> <pos_cat name="verb">goes</pos_cat> <!-- Feminine Gender 3rd Person Singular Number --> <pos_cat name="pronoun" meaning="male_per3_numN"> They</pos_cat> <pos_cat name="verb">go</pos_cat> <!-- Here, the word "they" refers to boys, Masculine Gender 3rd Person Plural Number --> <pos_cat name="pronoun" meaning="female_per3_numN"> They</pos_cat> <pos_cat name="verb">go</pos_cat> <!-- Here, the word "they" refers to gorls, Feminine Gender 3rd Person Plural Number --> <pos_cat name="pronoun" meaning="male_per2_num1">You</pos_cat> <pos_cat name="verb">go</pos_cat> <!-- Masculine Gender 2nd Person Singular Number --> <pos_cat name="pronoun" meaning="female_per2_num1">You</pos_cat> <pos_cat name="verb">go</pos_cat> <!-- Feminine Gender Second Person Singular Number --> <!-- English Concord --> <pos_cat name="noun" meaning="neuter_num1">Lots of Petrol</pos_cat> has been wasted. <!-- In Bangla sentence "Aami Ghare Achhi," morphological markup for "Ghare" as an example --> <pos_cat name="adverb" type="place" meaning="stem_Ghar_suffix_e_english_at+home">Ghare</pos_cat> <!-- In Bangla sentence "Shikshakgan Echhechhen," morphological markup for "Shikshakgan" as an example, suffix "gan" expresses plural form with respect --> <pos_cat name="noun" type="common" meaning="stem_Shikshak _suffix_gan_english_respected teachers">Shikshakgan</pos_cat> <!-- suffix "gulo" or "ra" expresses no respect to human --> <pos_cat name="noun" type="common" meaning="stem_Shikshak _suffix_gulo_english_ teachers">Shikshakgulo</pos_cat> <!-- In Bengali, suffix e.g, "ei" is the exclusive suffix, "o" stands for the inclusive suffix --> <pos_cat name="pronoun" type="personal" meaning="stem_Aami_ExclusiveSuffix_ei_english_ only I">Aamiei</pos_cat> Khelbo <!-- I only shall play --> <pos_cat name="pronoun" type="personal" meaning="stem_Aami_InclusiveSuffix_o_english_ I also">Aamio</pos_cat> Khelbo <!-- I also shall play --> <!-- Markup for the word in Bangla "Asantushto" (unhappy) --> <pos_cat name="adjective" meaning="stem_Santushto_prefix_A_english_un+happy">Asantushto </pos_cat> <!-- It is very important to add markup for the suffix along with the stem word for Indian languages to ease the translation or localization process --> <pos_cat name="noun" type="common" meaning="stem_Badur_suffix_0_english_bat_bird"> Badur </pos_cat> <!-- such markup helps an automatic morphological analyzer not to confuse by generating the stem as Badu and the most frequent suffix as R (i.e., Badu's ) --> <pos_cat name="noun" type="proper" meaning="suffix_0"> Ganguly </pos_cat> <!-- such markup helps an automatic morphological analyzer not to confuse by generating the stem as Gan and the most frequent suffix as Guly (i.e., Gans ) --> <!-- English suffix markups --> <pos_cat name="verb" meaning="stem_length_suffix_en">lengthen</pos_cat> <!-- English prefix markups --> <pos_cat name="adjective" meaning="stem_normal_prefix_ab">abnormal</pos_cat> <pos_cat name="verb" meaning="stem_cast_prefix_fore">forecast</pos_cat>
In many languages like Hindi, Bangla etc, there are many Euphonic (Sandhi) rules that unify words to form a new unified word. Spelling of the unified word is also get altered to some extent at the joining part. Often meaning of a unified word consists the meanings of its ingredient words. For example, in Bangla,Hindi etc, the unified word "Bidyalay" (school) = Bidya (education) + Alay (house), Rabeendra = Rabi + Indra, and Debarshi = Deb + Rishi, etc.
Markups for Euphonic rules are stated below.
<pos_cat name="noun" type="proper" meaning="euphonic_Rabi+Indra">Rabeendra </pos_cat> <pos_cat name="noun" meaning="euphonic_Bidya+Alay school">Bidyalay </pos_cat>
Again, In Bangla and Hindi etc, languages, there is usage of Compounding (or Samas) two or more words into a single short unified word where the suffix (Vibhakti) of the first word is removed while compounding. Meaning of the compounded word may not keep intact the meanings of all the constituent words. In Bahubrihi Samas, the compounded word may mean even a different third word only. COMPOUNDING Markups are shown as below.
Both the Euphonic and Compounding Markups are very useful for understanding both syntactic and semantic (or meaning) of various much used unified words in Bangla and Hindi languages etc.
<!-- Compounding Markups --> <!-- in English, we see various Euphonical Compound words of various parts of speech --> <pos_cat name="noun" meaning="compound_gas+light">gaslight </pos_cat> <pos_cat name="adjective" meaning="compound_water+tight">watertight </pos_cat> <pos_cat name="verb" meaning="compound_gun+fight">gunfight </pos_cat> <!-- an example of blending through compounding --> <pos_cat name="noun" meaning="compound_motor+pedal">moped </pos_cat> <pos_cat name="noun" meaning="compound_medical+care">medicare </pos_cat> <!-- English Compounding Nouns from Adverb + Verb or from Noun + Gerund --> <pos_cat name="noun" meaning="compound_off+shoot">off-shoot </pos_cat> <pos_cat name="noun" meaning="compound_book+keeping">bookkeeping </pos_cat> <pos_cat name="noun" meaning="compound_day+dreaming">day-dreaming </pos_cat> <!-- examples on compounding markups in Bangla and Hindi etc --> <pos_cat name="noun" meaning="compoundDwandwa_Bhai_and_Bon Brother and Sister">Bhaibon </pos_cat> <pos_cat name="noun" type="proper" meaning="compoundDwandwa_Ram_and_Seeta">RAMSEETA </pos_cat> <!-- Ramseeta looks to be of singular number, but it is plural (as shown by above compounding). Bangla sentence "Ramseeta Jai" should be translated in English as "Ramseeta go" --> <pos_cat name="noun" meaning="compoundTatpurush_Raja_Theke_Bhoi Afraid of King">Rajbhoi <!-- Raja (king) Theke (from) Bhoi (fear) --> </pos_cat> <pos_cat name="noun" meaning="compoundTatpurush_Raja_Theke_Bhoi Afraid of King">Rajbhoi <!-- Raja (king) Theke (from) Bhoi (fear) --> </pos_cat> <pos_cat name="noun" meaning="compoundDwigu_Sapto_Aher_Samahar week">Saptaho <!-- Sapto (seven) Aher (days') Samahar (collection) --> </pos_cat> <pos_cat name="noun" meaning="compoundDwigu_Chotuh_Aksharer_Samahar a word of four letters">Choturakshar </pos_cat> <pos_cat name="noun" meaning="compoundBahubrihi_Su_Hridoi_Jahar good hearted person">Suhrid <!-- Su (good) Hridoi (heart) Jahar (whose); this indicates a man who is good-hearted --> </pos_cat> <pos_cat name="noun" meaning="compoundBahubrihi_Su_Arno_Jahar gold">Swarno <!-- Su (good) Arno (colour) Jahar (whose); this indicates to gold only --> </pos_cat> <pos_cat name="noun" meaning="compoundBahubrihi_Peeta_Ambar_Jahar ShreeKrishna">Peetambar <!-- Peeta (colour bluish green) Ambar (body) Jahar (whose); this indicates to Hindu God ShreeKrishna only --> </pos_cat> <pos_cat name="noun" meaning="compoundKarmadharai_Dwi_Adhik_Dosh Twelve">Dwadosh <!-- Dwi (two) Adhik (more) Dosh (Ten) that is Twelve --> </pos_cat> <pos_cat name="noun" meaning="compoundKarmadharai_Ghaner_Nyai_Shyam as black as cloud">Ghanashyam <!-- Ghaner (cloud) Nyai (like) Shyam (Black)--> </pos_cat>
We can also provide metadata on a class of an object (e.g. for nounian words), which helps in meaningful translation and in preparing synset etc. This is to provide semantic (or some preliminary ontology) metadata. For an example, autorickshaw is one kind of vehicle only.
<pos_cat name="noun" meaning="class_vehicle">Autorickshaw </pos_cat> <pos_cat name="noun" type="proper" meaning="class_femalename">Shanti </pos_cat> OR, <pos_cat name="noun" type="proper" meaning="class_human gpn_f_3_1">Shanti <!-- i.e., gender person number_female_3rdPerson_SingularNumber --> </pos_cat> <!-- Such semantic metadata helps in understanding the semantic of a sentence [e,g., in Bangla sentence: "Ghare Shanti Nei" (that is, in English: (a) Shanti (a girl) is not at home; Or, (b) There is no peace at home)] --> <pos_cat name="noun" type="abstract" meaning="class_emotion peace">Shanti </pos_cat> <!-- Shanti (peace) or Shanti (girl's first name) --> <!-- For English sentence "Go to Jamuna Bank," ontological markups --> <sentence_cat name="imperative"> Go to <pos_cat name="noun" meaning="class_river">Jamuna</pos_cat> <pos_cat name="noun" meaning="class_land">Bank</pos_cat> <!-- i.e., go to the bank of Jamuna river --> </sentence_cat> <!--Again, for English sentence "Go to Jamuna Bank," ontological markups --> <sentence_cat name="imperative"> Go to <pos_cat name="noun" meaning="gpn_f_3_1">Jamuna</pos_cat> <pos_cat name="noun" meaning="class_financial-institution">Bank</pos_cat> <!-- i.e., go to the bank of Jamuna (for money) --> </sentence_cat> OR, <sentence_cat name="imperative"> Go to <pos_cat name="noun" type="compound" meaning="gpn_n_3_1 class_finance">Jamuna Bank <!-- neuter gender --> </pos_cat> <!-- i.e., go to the bank of Jamuna (for money) --> </sentence_cat>
Understanding Sentence-Level Markups:-
A sentence is a set or group of words which makes complete sense. Semantically Sentences are of major four kinds: (a) a Declarative or Assertive sentence (that makes a statement or assertion, e.g., He sat on a chair.), (b) an Interrogative sentence (that asks a question, e.g., Where do you go ?) (c)an Imperative sentence (that expresses a command or an entreaty, e.g., Be quiet.), (d)an Exclamatory sentence (that expresses strong feeling, e.g., How cold the day is !). Other semantic classifications of sentences are: (e) Praying Sentence expresses a prayer, e.g., "May God bless you." (f) Causative Sentence expresses a cause and effect or condition, e.g., "If you work hard you will definitely succeed." (g) Suspicion Sentence expresses a guessing or suspicion, e.g., "It might rain now." (h)Cursed Sentence expresses an imprecation, "You devil, get ruined." (i) Proverbial Sentence denotes a proverbial expression, e.g., "Cut your coat according to your cloth." "Grapes are sour." (j) Taunt Sentence expresses a jeering remark or sarcastic or derisive comment, e.g., "That penniless boy behaves as if he is a king." (k) Chant sentences expresses chanting, e.g., "Om Tot Sot.", that we don't intend to translate. (l) Link sentence is to express a link to some web page etc.
A sentence is diveded into two main parts- (a) the Subject (i.e., the person or thing about which something is said) and (b) the Predicate (i.e., what is said about the person or thing denoted by the subject.) The subject may consist of one word or several words. The predicate may also consist one or several words. In other words, we must have a subject to speak about and we must say or predicate something about that subject. For an example, in the sentence "The sun gives light.", "The sun" is the subject and "gives light." is the predicate.
A Phrase is a group of words which makes sense, but not complete sense. For an example, "in the east" in the sentence "The sun rises in the east." is a phrase. A Clause is a group of words which forms part of a sentence, and contains a Subject and a Predicate. For eample, in the sentence "She has a chain which is made of gold", the group of words "which is made of gold" is called a clause that contains a Subject (which) and a Predicate (is made of gold). There are three major formative categories of sentences: (a) a Simple Sentence has only one Subject and one Predicate (or, a simple sentence has only one Finite verb). For example, the sentence "He goes to school" is a simple sentence, (b) a Compound Sentence is made up of two or more Main or Principal Clauses(i.e., a Main Clause is an independent clause that makes good sense by itself and can stand by itself as a separate sentence). For example, the sentence "The moon was bright and we could see or way" is a compound sentence. Here, we have two main clauses: (i) "The moon was bright" and (ii) "we could see our way". These two clauses are joined by the co-ordinating conjunction and, (c) a Complex Sentence consists of one Main Clause and one or more Subordinate Clauses. For example, the sentence "They rested when evening came" is a complex sentence. As the clause "They rested" makes good sense by itself, so it is a Main clause. Whereas, the clause "when evening came" cannot stand by itself and cannot make good sense. It is dependent on the clause "They rested". It is therefore called a Dependent or Subordinate Clause.
We can also use this 3-Tier Schema approach for indicating whether a content or a sentence or a word needs to be kept "as it is" (without translation) or not. The following markups can be used:
<!-- Markup to exclude a sentence from translation --> <content_domain name="religion" type="no_translation"> <!-- to add metadata information for not translating a content --> Om Ganga. <!-- default is to translate --> .... sentences ... </content_domain>
An example is given below how to add metadata information to skip translation process for a sentence in a Cultural content's chanting part.
<!-- Markup for Cultural Chant that need not be translated --> <content_domain name="cultural"> <!-- default is to translate --> ..... ..... sentences ..... <!-- above sentences need to be translated--> <sentence_cat name="chant" type="no_translation"> Om Ganeshaioh Namoh. </sentence_cat> </content_domain>
In many sentences we often use multilingual words. For example, in the Hindi sentence, "Kaam Jaldi Start Kijiye" (i.e., in English:- "Start the work immediately." Lexicons:- Kaam/ Work, Jaldi/ immediately, Kijiye/ Do). Please note that there is an English word "start" in the source language Hindi sentence. Such usage of multilingual wordings are very common in any urban area. As we are providing the meaning of a foreign language word (e.g., start) in a sentence of some other source language, say, Hindi, so there won't be any problem for a translation parserfor understanding a sentence that contains multilingual words.
<!-- Markup for a Sentence having Multilingual Words --> <!-- Markup for the Hindi sentence "Kaam Jaldi Start Kijiye" --> <text xml:lang="hin"> <sentence_cat name="imperative"> Kam Jaldi <pos_cat name="verb" type="compound" meaning="start"> start kijiye </pos_cat> </sentence_cat> </text>
<!-- Markup for a Taunt-Sentence --> <text xml:lang="en"> <sentence_cat name="taunt"> <pos_cat name="noun" type="proper">Ramen</pos_cat> is a joker at our office. </sentence_cat> </text>
<!-- Markup for a Taunt-Sentence in Bangla- Kaana Chheler Naam Padmalochon- in English- The blind boy's name is Padmalochon (eyes like lotus's petals) --> <text xml:lang="ben"> <sentence_cat name="taunt"> <pos_cat name="adjective" meaning="blind">Kana</pos_cat> <pos_cat name="noun" type="common" meaning="boy's"> Chheler </pos_cat> <pos_cat name="noun" meaning="name"> Naam </pos_cat> <pos_cat name="verb" type="linking" meaning="is"> </pos_cat> <!--a missing verb --> <pos_cat name="noun" type="proper"> Padmalochon </pos_cat> is a joker at our office. </sentence_cat> </text>
In many cases we need to skip slang words or slang sentences because, we don't want them to be processed or be presented. The markup to skip words or sentences is shown below.
<!-- Markup to skip a word --> <pos_cat name="skip"> slang_word </pos_cat>
<!-- Markup to skip a slang sentence --> <sentence_cat name="skip"> slang_sentence </sentence_cat>
We often see that an image (along with an embedded Tooltip text sentence) is inserted
in a sentence. We intend to translate the sentence as well as the ToolTip text sentence.
We may use the following markup.
<!-- Sentence-Level Markup for ToolTip text sentence embedded inside an Image --> <para> Click here <image source="begin.jpg" alt="Begin now." /> <sentence_cat name="alt_value"> Begin now. </sentence_cat> to play now. </para>
Sentence Level Markup for the HTML Title Attribute:
A "title" attribute is often inserted inside any HTML tag. Inserting this attribute gives the element a tooltip that pops up when the mouse moves over it (for an example here, on W3C). For Internationalization and Localization, we should translate the value of the HTML Title Attribute. Such markup is useful for translating VBScript /JavaScript Tooltips Text on various events like ONMOUSEOVER etc.
<!-- Sentence-Level Markup for HTML Title Attribute --> <sentence_cat name="title_value"> <a href="http://www.w3.org" title="Click here for the W3C ">W3C </a> </sentence_cat> Another example of Markup for "HTML Title Attribute" for a form using the "Input Text Box." <!-- Markup for HTML Title Attribute using Input Text Box --> <sentence_cat name="title_value"> <form> <input type="text" size=20 title="Enter your email address here"> <input type="button" value="Submit"> </form> </sentence_cat> An example on Markup for Javascript ToolTips Text. <sentence_cat name="scripttitle_value"> <!-- Markup for Javascript ToolTips text on events like ONMOUSEOVER --> <A HREF="/tips/page2.asp" ONMOUSEOVER="this._tip='It <FONT COLOR=red> <B>simplifies</B></FONT> DHTML with powerful library'"> DHTML Library </A> </sentence_cat>
An example on Markup for Javascript document.write text.
<!-- Markup for Javascript document.write text --> <sentence_cat name="script_document_write_value"> <script type="text/javascript"> var d = new Date() var time = d.getHours() if (time<12) { document.write("<b>Good morning</b>") } </script> </sentence_cat>
An example on Markup for Javascript document.title text that we may need to translate. Here the alert() function is to pop up a message box telling us the title of an HTML document by accessing the document object that contains all the information needed about the HTML document. The document object includes a property called title which returns the title of the current HTML document.
<!-- Markup for Javascript document.title text --> <sentence_cat name="script_document_title"> <HTML> <HEAD> <TITLE>web page title</TITLE> <SCRIPT LANGUAGE = JavaScript> alert(document.title) alert(document.title="Now it is changed") </SCRIPT> </HEAD> <BODY> </BODY> </HTML> </sentence_cat>
Word-level and Sentence-Level Markups for HTML Form Captions or Labels are stated below.
This is to alert the translator towards localization.
<!-- Word Level Markup for HTML Form (Label) Captions that we may need to translate --> <pos_cat name="form_label"> <form> Given Name: <input type="text" name="givenname"> <br> Family Name: <input type="text" name="familyname"> </form> </pos_cat> <!-- Word Level Markup for HTML Form (Radio) Captions that we may need to translate --> <pos_cat name="form_label"> <form> <input type="radio" name="title" value="Mr"> Mr <br> <input type="radio" name="title" value="Ms"> Ms </form> </pos_cat> <!-- Sentence-Level Markup for Checkbox Captions that we may need to translate --> <sentence_cat name="form_label"> <form> <input type="checkbox" name="aged"> Age above 50 years <br> <input type="checkbox" name="young"> Age in between 20-35 </form> </sentence_cat>
The metadata informations on the order of Subject, Verb and Object (SVO) or Subject, Object and Verb (SOV) or Object, Verb and Subject (OVS) in a sentence are often useful in faster parsing a sentence in the translation process. The order of words in a two element sentence is same in all languages viz. subject + verb. But in a three -element or four element sentence the order of words varies from language to language. For example, the usual order of words in Bengali, Hindi and Urdu etc is: <subject> <object> <verb> or <subject> <complement> <verb>. Various Indian languages like Bangla, Hindi sentences normally follow SOV structure. English sentence follows SVO. However, it is not uncommon to see other forms also. Other sentence patterns in English: <subject> <verb> <complement> or <subject> <verb> <object> <object> or <subject> <verb> <object> <complement>. Word order in English is relatively more rigid compared to Indian Languages which are called "free word order" languages. We can add such markups in the following way.
<text xml:lang="ben"> <!-- Sentence- Level Markups for SOV or SVO or OVS for a Bangla sentence- Paul Apple Kheyechhe (i.e., Paul has eaten apple - in English) --> <sentence_cat name="assertive" type="sov"> <pos_cat name="noun" type="proper"> Paul </pos_cat> <pos_cat name="noun" type="common">Apple </pos_cat> <pos_cat name="verb" type="transitive">Kheyechhe </pos_cat> </sentence> <!-- exaples on OVS or SVO for the same sentence without word-level markups --> <sentence_cat name="assertive" type="ovs"> Apple Kheyechhe Paul <!-- such OVS markup prevents the above sentence to be translated to- Apple has eaten Paul (a bad translation output) --> </sentence> <sentence_cat name="assertive" type="svo"> Paul Kheyechhe Apple </sentence> </text>
We also need to add markup for noun-phrase or verb-phrase toward translation aids. For an example, the Hindi sentence "Sisa (Lead - a metal) Rahit (a person's first name) Petrol (oil) Milta Hai (is available.)" Here, the noun phrase "Sisa Rahit Petrol" needs to be treated in different way. The noun-phrase, verb-phrase informations help a translator to do so when we markup in the following way.
<text xml:lang="hin"> <!-- ISO 639-2 is followed here for Language coding --> <!-- Markup for noun phrase, verb phrase --> <sentence_cat name="assertive"> <pos_cat name="noun" type="phrase" meaning="leadless petrol"> Sisa Rahit Petrol</pos_cat> <!-- here, Rahit (as preceeded by a material noun) is not to mean a person but to mean- less or without --> <pos_cat name="verb" type="phrase" meaning="is available"> Milta Hai </pos_cat> </sentence_cat> <!-- translated output in English- Leadless Petrol is available --> <!-- the above markups along with meaning attribute help translator a lot even if the source language specific resource is not rich with the localizers--> </text>
We can also add the noun_phrase and verb_phrase sentence-level markups in the following way.
<text xml:lang="hin"> <!-- Markup for noun phrase, verb phrase --> <sentence_cat name="noun_phrase" meaning="leadless petrol"> Sisa Rahit Petrol </sentence_cat> <!-- here, Rahit (as preceeded by a material noun) is not to mean a person but to mean- less or without --> <sentence_cat name="verb_phrase" meaning="is available"> Milta Hai </sentence_cat> . <!-- translated output in English- Leadless Petrol is available --> <!-- the above markups along with meaning attribute help translator a lot even if the source language specific resource is not rich with the localizers--> </text>
Understanding Content Domain Level Markups:-
In order to find out the content domain for a paragraph of text, we normally find that content domain is nothing but the most frequently occurred word (e.g. a noun) in that paragraph. For example, in a paragraph, if we see that the word-frequency of a word or its synonym word say, "football" is the maximum among other words' frequencies, then the content domain is "football" only. Again, a word with the maximum word-desnsity may often be a Content Domain. The ratio of the number of times a word ( or its synonyms) appears in a document to the size (total number word counts) of the document is called the word density. It is a measure of how important a word is to the overall content of the document. A higher word density results in a higher relevance ranking. We should not consider preposition, interjection (e.g., Hallo, Sir etc.,) in counting the word density. In many speech/ communication we see "Sir" as the highest word density and it may mislead in finding out the Content Domain. Rather, we should consider the noun words in finding the most frequently occurred word towards Content Domain. We may use the content domain like music (for musical text) or doctor (for his/her prescriptions) or mathematics or defence or sports and so on according to the content text.
Challenges
The proposed 3-Tier XML Schema aims to markup both syntactic and semantic metadata information in the structure of an XML document. This approach is an excellent solution to yield meaningful translation. Such embedded information are very important to both the internationalization and localization processes.
We need to follow the following three basic steps of the 3-Tier Schemas to embed linguistic-related metadata information in the structure of an XML document in order to improve the translation process for obtaining more meaningful translation. This 3-tier schema scheme is also useful for the Translation Memory processes to keep context markups when I18N & L10N developers use this scheme for both source and target text.
- ======
The Importance of the 3-tier generalized schema:
- In many languages, there are many common words that have
many meanings (word sense ambiguity) at various contexts (though POS category may remain same). Only based on the context of content domain and sentence, we understand the appropriate meaning of such words having word sense ambiguities.
For example, in English: the word "bat" has multiple meaning-
(a) "a bird" (in the content-domain of zoology/animal) or
(b) in the content domain of sports it means "a playing instrument" to hit a ball (like cricket bat ). In both cases, part-of-speech of "bat" is noun (finer category- common noun) only.
Similarly, for an Indian language (say, for Bangla): the
word "Dhar" has multiple meaning-
(a) "to catch" in the content domain of say sports/law, (b) "to assume" in the content domain of education, (c) "to have pain" in the content domain of medical/health, (d) "to come to an end" in the content domain of weather and (e) "to begin" in the content domain of culture, [a-e with
POS: verb] and
(f) a family name [proper noun] and so on.
- The 1st level schema (content domain markup) is useful for
marking the context information for a paragraph of translatable content.
- The 2nd level schema (sentence level markups) takes care
of translatable proverbs, idioms, dialect and usages etc for any human language in the world.
- The 3rd level schema (word level markups) is to obtain the
most appropriate meaning of "a word" (having POS ambiguity with multiple POS and word sense ambiguity) in a senetence inside a text content.
- Content author will not find any difficulty on using such
markups because this scheme does not limit one to add an appropriate markup as an attribute.
- Content author may not use such three level markups at all
parts (not for all words and sentences) of a document.
- Markups need to be used only at the sensitive or difficult
parts or ambiguous parts of a document.
- For some languages, a content author even may not need to
add finer sub-category markups at his/her document.
======================
How to use the 3-tier Schema?
Here is an example using the 3-tier schema.
We may consider a content with a sentence, for example, "He played in Odyssey."
-----
<content_domain name="literature" type="drama"> .... .... <sentence_cat name="semantic" type="demonstrative"> He <pos_cat name="verb" meaning="acted"> played</pos_category> in Odyssey. <!-- here, "played" implies the verb "acted" --> </sentence_cat> ......... ......... ......... </content_domain>
------
We can use such markups in our convenient ways for describing both syntactic & semantic information to a translator.
==========================
Quick Guidelines
- Examples on content domain categories: travel, science, information technology and fashion etc.
- Examples on sentence categories: for Formative type- simple, compound and complex, and for Semantic type- taunt, proverb, suspicion and demonstrative etc.
- Examples on word POS categories: for Noun type- proper, abstract,compound etc, and so on.
Developers can categorize the linguistic-related metadata information in the following ways.
Metadata information may vary little from one language to another language.
Content Domain <content_domain> <it> <enggtech> <factory> <science> <medical> <mathematics> <agriculture> <environment> <economics> <weather> <astrology> <law> <businesstrade> <finance> <advertise> <occupation> <education> <geography> <history> <philosophy> <religion> <no_translation> <literature> <lyrics> <story> <poetry> <drama> <dialect> <politics> <sports> <entertainment> <news> <international> <national> <travel> <communications> <dialect> <society> <humanities> <civic> <cultural> <no_translation> <administrative> <emotion> <review> <defence> <gossip> <figure> <sex> <violence> <citation> <speech> Sentence Categories <sentence_categories> <formative> <simple> <svo> <sov> <ovs> <vos> <compound> <complex> <nounian> <adjective> <adverb> <noun_phrase> <verb_phrase> <semantic> <demonstrative> <affirmative> <negative> <interrogative> <direct> <indirect> <imperative> <direct> <indirect> <praying> <direct> <indirect> <exclamatory> <direct> <indirect> <causative> <suspicion> <cursed> <taunt> <ironical> <ridiculous> <proverbial> <phrases_idoms> <voice> <active> <passive> <impersonal> <link> <skip> <alt_value> <title_value> <script_document_title> <script_document_write_value> <scripttitle_value> <form_label> == Parts-of-Speech Metadata == <!-- Finer POS Categories give us more detailed word-level syntactic and semantic information that might help a translation process in producing more meaningful translated output. Finer categories also provide pragmatic information or deep semantics. For example, in the sentences "Ram entered the room and uttered to others, 'Don't you feel hot inside.' " Here, we get the pragmatics as "Ram wants to switch on the Air Condition machine or to switch on the Fan in that room. --> <pos_categories> <noun> <proper> <common> <material> <collective> <cardinals> <abstract> <abstract_concrete> <verbal> <preceeding_noun_of_tiltle> <following_noun_of_title> <noun_location> <numeral> <unit_of_measurement> <hyphenated_numbers> <ordinal_number> <fractional_number> <compound> <echo> <negative> <numbers> <phrase> <pronoun> <personal> <demonstative> <indefinite> <interrogative> <relative> <correlative> <honorific> <reflexive> <inclusive> <denoting_others> <adjective> <general> <abstract> <quantitative> <ordinal> <proper> <material> <compound> <interrogative> <pronounian> <pronoun's> <demonstrative> <end_inflecting> <verb_ending> <repeated_verb_ending> <onomatopoeic> <interjectory> <comparative> <superlative> <adverb> <abstract> <time> <place> <repetitive> <adjective_adjective> <adjective_adverb> <verb> <primary_root> <causative> <noun> <joining> <compound> <intransitive> <transitive> <anaphoric> <linking> <phrase> <group> <negative> <imperative> <nonfinite> <conjunction> <coordinating> <adversative> <subordinating> <disjunctive> <exclusion> <conclusive> <suspicion> <eternal_joined> <abstract> <addressing> <interrogative> <onomatopoeic> <indeclinable> <ending> <tense> <case> <personal> <imperative> <participle_principal> <post_position> <preposition> <punctuation> <comma> <sentence_final> <quote> <left_parenthesis> <right_parenthesis> <mid_sentence> <other_punctuation> <date> <dd-mm-yy> <dd/mm/yy> <dd-mm-yyyy> <dd/mm/yyyy> <mm/dd/yyyy> <mm-dd-yyyy> <yy/mm/dd> <yyyy/mm/dd> <yyyy-mm-dd> <dd mmm, yyyy> <mmm dd, yyyy> <dd She mmmm, yyyy> <dd i mmmm, yyyy> <dd mmmm, yyyy> <time> <hh:mm> <hh:mm:ss> <mm to hh> <hh Bajkar mm> <hh Ta Beje mm> <hh Ta Bajte mm> <Poune hh> <Soaa hh> <hh Kal> <Saare hh> <hh Aara> <hh Ara> <Der> <Dhai> <alt_value> <link> <skip> <form_value>
====
A schema for the content domain using element "content_domain" and three attributes namely, "name", "type" and "meaning" is shown below.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:cont="http://www.kolkatacdac.in/w3ci18ncd" elementFormDefault="qualified"> <xs:import namespace="http://www.kolkatacdac.in/w3ci18ncd" schemaLocation="C:\Documents and Settings\Administrator\My Documents\w3c-its-wg\cd1.xsd"/> <xs:element name="cd"> <xs:complexType> <xs:sequence> <xs:element ref="cont:content_domain" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.kolkatacdac.in/w3ci18ncd" elementFormDefault="qualified"> <xs:element name="content_domain"> <xs:complexType mixed="true"> <xs:attribute name="name" type="xs:string" use="required"/> <xs:attribute name="type" type="xs:string"/> <xs:attribute name="meaning" type="xs:string"/> </xs:complexType> </xs:element> </xs:schema>
A typical schema of the content_domain using elements is shown below.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xs:complexType name="catType"> <xs:attribute name="name" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="administrative"/> <xs:enumeration value="advertise"/> <xs:enumeration value="agriculture"/> <xs:enumeration value="astrology"/> <xs:enumeration value="businesstrade"/> <xs:enumeration value="citation"/> <xs:enumeration value="communications"/> <xs:enumeration value="defence"/> <xs:enumeration value="diallect"/> <xs:enumeration value="economics"/> <xs:enumeration value="education"/> <xs:enumeration value="emotion"/> <xs:enumeration value="enggtech"/> <xs:enumeration value="entertainment"/> <xs:enumeration value="environment"/> <xs:enumeration value="figure"/> <xs:enumeration value="finance"/> <xs:enumeration value="geography"/> <xs:enumeration value="gossip"/> <xs:enumeration value="history"/> <xs:enumeration value="it"/> <xs:enumeration value="law"/> <xs:enumeration value="literature"/> <xs:enumeration value="mathematics"/> <xs:enumeration value="medical"/> <xs:enumeration value="news"/> <xs:enumeration value="occupation"/> <xs:enumeration value="philosophy"/> <xs:enumeration value="politics"/> <xs:enumeration value="religion"/> <xs:enumeration value="review"/> <xs:enumeration value="science"/> <xs:enumeration value="sex"/> <xs:enumeration value="society"/> <xs:enumeration value="speech"/> <xs:enumeration value="sports"/> <xs:enumeration value="travel"/> <xs:enumeration value="violence"/> <xs:enumeration value="weather"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="type"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="civic"/> <xs:enumeration value="cultural"/> <xs:enumeration value="drama"/> <xs:enumeration value="humanities"/> <xs:enumeration value="international"/> <xs:enumeration value="lyrics"/> <xs:enumeration value="national"/> <xs:enumeration value="poetry"/> <xs:enumeration value="story"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> <xs:element name="content_domain"> <xs:complexType> <xs:sequence> <xs:element name="cat" type="catType"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:c="http://www.kolkatacdac.in/w3ci18ncd" elementFormDefault="qualified"> <xs:import namespace="http://www.kolkatacdac.in/w3ci18ncd" schemaLocation="C:\Documents and Settings\Administrator\My Documents\auth-contdom-13091.xsd"/> <xs:element name="content_domain"> <xs:complexType> <xs:sequence> <xs:element ref="c:cat" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
======
A schema for the sentence category level using element "sentence_cat" and three attributes namely, "name", "type" and "meaning" is shown below.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <!-- Sentence Category Schema using attributes --> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:scat="http://www.kolkatacdac.in/w3ci18nsc" elementFormDefault="qualified"> <xs:import namespace="http://www.kolkatacdac.in/w3ci18nsc" schemaLocation="C:\Documents and Settings\Administrator\My Documents\w3c-its-wg\sencat1.xsd"/> <xs:element name="sc"> <xs:complexType> <xs:sequence> <xs:element ref="scat:sentence_cat" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.kolkatacdac.in/w3ci18nsc" elementFormDefault="qualified"> <xs:element name="sentence_cat"> <xs:complexType mixed="true"> <xs:attribute name="name" type="xs:string" use="required"/> <xs:attribute name="type" type="xs:string"/> <xs:attribute name="meaning" type="xs:string"/> </xs:complexType> </xs:element> </xs:schema>
A typical schema for Sentence_categories using elements is shown below.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.kolkatacdac.in/w3ci18ns" elementFormDefault="qualified"> <xs:element name="cat"> <xs:complexType mixed="true"> <xs:attribute name="name" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="formative"/> <xs:enumeration value="semantic"/> <xs:enumeration value="voice"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="type" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="active"/> <xs:enumeration value="causative"/> <xs:enumeration value="complex"/> <xs:enumeration value="compound"/> <xs:enumeration value="cursed"/> <xs:enumeration value="demonstrative"/> <xs:enumeration value="exclamatory"/> <xs:enumeration value="imperative"/> <xs:enumeration value="impersonal"/> <xs:enumeration value="interrogative"/> <xs:enumeration value="passive"/> <xs:enumeration value="praying"/> <xs:enumeration value="proverbial"/> <xs:enumeration value="simple"/> <xs:enumeration value="suspicion"/> <xs:enumeration value="taunt"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="subtype"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="adjective"/> <xs:enumeration value="adverb"/> <xs:enumeration value="affirmative"/> <xs:enumeration value="direct"/> <xs:enumeration value="indirect"/> <xs:enumeration value="ironical"/> <xs:enumeration value="negative"/> <xs:enumeration value="nounian"/> <xs:enumeration value="ridiculous"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> </xs:element> </xs:schema> <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:s="http://www.kolkatacdac.in/w3ci18ns" elementFormDefault="qualified"> <xs:import namespace="http://www.kolkatacdac.in/w3ci18ns" schemaLocation="C:\Documents and Settings\Administrator\My Documents\auth-sencat-13091.xsd"/> <xs:element name="sentence_type"> <xs:complexType> <xs:sequence> <xs:element ref="s:cat" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
==============
A schema for the Word Level POS Category using element "pos_cat" and three attributes namely, "name", "type" and "meaning" is shown below.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <!-- Word Level (POS) Schema --> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:pcat="http://www.kolkatacdac.in/w3ci18npc" elementFormDefault="qualified"> <xs:import namespace="http://www.kolkatacdac.in/w3ci18npc" schemaLocation="C:\Documents and Settings\Administrator\My Documents\w3c-its-wg\poscat1.xsd"/> <xs:element name="pc"> <xs:complexType> <xs:sequence> <xs:element ref="pcat:pos_cat" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.kolkatacdac.in/w3ci18npc" elementFormDefault="qualified"> <xs:element name="pos_cat"> <xs:complexType mixed="true"> <xs:attribute name="name" type="xs:string" use="required"/> <xs:attribute name="type" type="xs:string"/> <xs:attribute name="meaning" type="xs:string"/> </xs:complexType> </xs:element> </xs:schema>
A typical schema for Parts-of-Speech (POS) Markups using elements is shown below.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.kolkatacdac.in/w3ci18npos" elementFormDefault="qualified"> <xs:element name="cat"> <xs:complexType mixed="true"> <xs:attribute name="name" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="adjective"/> <xs:enumeration value="adverb"/> <xs:enumeration value="conjunction"/> <xs:enumeration value="noun"/> <xs:enumeration value="post_position"/> <xs:enumeration value="preposition"/> <xs:enumeration value="pronoun"/> <xs:enumeration value="punctuation"/> <xs:enumeration value="verb"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="type"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="abstarct_concrete"/> <xs:enumeration value="abstract"/> <xs:enumeration value="addressing"/> <xs:enumeration value="adjective_adjective"/> <xs:enumeration value="adjective_adverb"/> <xs:enumeration value="adversative_coordinating"/> <xs:enumeration value="cardinals"/> <xs:enumeration value="case"/> <xs:enumeration value="causative"/> <xs:enumeration value="collective"/> <xs:enumeration value="comma"/> <xs:enumeration value="common"/> <xs:enumeration value="comparative"/> <xs:enumeration value="compound"/> <xs:enumeration value="conclusive"/> <xs:enumeration value="coordinating"/> <xs:enumeration value="correlative"/> <xs:enumeration value="demonstrative"/> <xs:enumeration value="denoting_others"/> <xs:enumeration value="disjunctive"/> <xs:enumeration value="end_inflecting"/> <xs:enumeration value="eternal_joined"/> <xs:enumeration value="exclusion"/> <xs:enumeration value="following_noun_of_title"/> <xs:enumeration value="fractional_number"/> <xs:enumeration value="general"/> <xs:enumeration value="group"/> <xs:enumeration value="hyphenated_numbers"/> <xs:enumeration value="imperative"/> <xs:enumeration value="inclusive"/> <xs:enumeration value="indeclinable"/> <xs:enumeration value="indefinite"/> <xs:enumeration value="interjectory"/> <xs:enumeration value="interrogative"/> <xs:enumeration value="intransitive"/> <xs:enumeration value="joining"/> <xs:enumeration value="left_parenthesis"/> <xs:enumeration value="material"/> <xs:enumeration value="mid_sentence"/> <xs:enumeration value="negative"/> <xs:enumeration value="non_finite"/> <xs:enumeration value="noun"/> <xs:enumeration value="noun_location"/> <xs:enumeration value="numbers"/> <xs:enumeration value="numeral"/> <xs:enumeration value="onomatopoeic"/> <xs:enumeration value="ordinal"/> <xs:enumeration value="ordinal_number"/> <xs:enumeration value="other"/> <xs:enumeration value="participle/principal"/> <xs:enumeration value="personal"/> <xs:enumeration value="place"/> <xs:enumeration value="preceeding_noun_of_title"/> <xs:enumeration value="preceeding_noun_title"/> <xs:enumeration value="primary/root"/> <xs:enumeration value="pronoun's"/> <xs:enumeration value="pronounian"/> <xs:enumeration value="proper"/> <xs:enumeration value="quantitative"/> <xs:enumeration value="quote"/> <xs:enumeration value="reflexive"/> <xs:enumeration value="relative"/> <xs:enumeration value="repeated_verb_ending"/> <xs:enumeration value="repetitive"/> <xs:enumeration value="right_parenthesis"/> <xs:enumeration value="sentence_final"/> <xs:enumeration value="subordinating"/> <xs:enumeration value="superlative"/> <xs:enumeration value="suspicion"/> <xs:enumeration value="tense"/> <xs:enumeration value="time"/> <xs:enumeration value="transitive"/> <xs:enumeration value="unit_of_measurement"/> <xs:enumeration value="verb_ending"/> <xs:enumeration value="verbal"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> </xs:element> </xs:schema> <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:p="http://www.kolkatacdac.in/w3ci18npos" elementFormDefault="qualified"> <xs:import namespace="http://www.kolkatacdac.in/w3ci18npos" schemaLocation="C:\Documents and Settings\Administrator\My Documents\ITS-XML-Test\auth-postag-14091.xsd"/> <xs:element name="pos_tag"> <xs:complexType> <xs:sequence> <xs:element ref="p:cat" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
XML Tagged Bengali or Bangla Text is stated below:
<?xml version="1.0" encoding="UTF-16" standalone="yes"?> <!-- An example on XML Tagged Bangla (or Bengali) Text --> <text xml:lang="ben" xmlns:xsi="http://www.w3.org/2001/XMLSchema" xsi:noNamespaceSchemaLocation="E:\goutam-w3c-its\3lay.xsd" schemaLocation="E:\goutam-w3c-its\cd1.xsd E:\goutam-w3c-its\sencat1.xsd E:\goutam-w3c-its\poscat1.xsd" xmlns:cont="http://www.kolkatacdac.in/w3ci18ncd" xmlns:scat="http://www.kolkatacdac.in/w3ci18nsc" xmlns:pcat="http://www.kolkatacdac.in/w3ci18npc"> <cont:content_domain name="agriculture"> <scat:sentence_cat name="demonstrative"> <pcat:pos_cat name="noun" meaning="farmer"> চাষি </pcat:pos_cat> <pcat:pos_cat name="verb" meaning="said"> বললেন </pcat:pos_cat> <pcat:pos_cat name="pronoun" meaning="I"> আমি </pcat:pos_cat> <pcat:pos_cat name="verb" type="missing_auxiliary" meaning="am"></pcat:pos_cat> <pcat:pos_cat name="adjective" meaning="a"> একজন </pcat:pos_cat> <pcat:pos_cat name="adjective" meaning="ordinary"> সামান্য </pcat:pos_cat> চাষি <pcat:pos_cat name="punctuation" type="sentence_final" meaning="."> । </pcat:pos_cat> </scat:sentence_cat> <scat:sentence_cat name="demonstrative"> <pcat:pos_cat name="adverb" meaning="only"> মাত্র </pcat:pos_cat> <pcat:pos_cat name="adjective" meaning="five"> পাঁচ </pcat:pos_cat> <pcat:pos_cat name="noun" type="unit_of_measurement"> বিঘে </pcat:pos_cat> <pcat:pos_cat name="noun" meaning="land"> জমি </pcat:pos_cat> <pcat:pos_cat name="adverb" meaning="by my own hand"> নিজের হাতে </pcat:pos_cat> <pcat:pos_cat name="verb" meaning="cultivate"> চাষ করি </pcat:pos_cat> । </scat:sentence_cat> </cont:content_domain> </text>
XML Tagged Assamese Text is stated below:
<?xml version="1.0" encoding="UTF-16" standalone="yes"?> <!-- XML Tagged Assamese (or Akhamiya) Text --> <text xml:lang="asm" xmlns:xsi="http://www.w3.org/2001/XMLSchema" xsi:noNamespaceSchemaLocation="E:\goutam-w3c-its\3lay.xsd" schemaLocation="E:\goutam-w3c-its\cd1.xsd E:\goutam-w3c-its\sencat1.xsd E:\goutam-w3c-its\poscat1.xsd" xmlns:cont="http://www.kolkatacdac.in/w3ci18ncd" xmlns:scat="http://www.kolkatacdac.in/w3ci18nsc" xmlns:pcat="http://www.kolkatacdac.in/w3ci18npc"> <cont:content_domain name="family"> <scat:sentence_cat name="demonstrative"> <pcat:pos_cat name="pronoun" meaning="my">মোৰ</pcat:pos_cat> <pcat:pos_cat name="noun" meaning="family's"> পৰিবাৰৰ </pcat:pos_cat> <pcat:pos_cat name="noun" type="common" meaning="boy"> ল'ৰা </pcat:pos_cat> <pcat:pos_cat name="pronoun" type="demonstrative"> এটা </pcat:pos_cat> <pcat:pos_cat name="verb" type="finite"> হৈছে </pcat:pos_cat> <pcat:pos_cat name="punctuation" type="sentence_final" meaning="."> । </pcat:pos_cat> </scat:sentence_cat> <scat:sentence_cat name="demonstrative"> <pcat:pos_cat name="pronoun" meaning="he"> তেওঁ </pcat:pos_cat> <pcat:pos_cat name="noun" type="common" meaning="mother's"> মাকৰ </pcat:pos_cat> <pcat:pos_cat name="noun" meaning="at room"> ঘৰত </pcat:pos_cat> <pcat:pos_cat name="verb" meaning="is"> আছেগৈ </pcat:pos_cat> । </scat:sentence_cat> <scat:sentence_cat name="demonstrative"> <pcat:pos_cat name="pronoun" type="demonstrative" meaning="this"> এইটো </pcat:pos_cat> <pcat:pos_cat name="pronoun"> মোৰ </pcat:pos_cat> <pcat:pos_cat name="adjective" meaning="first"> প্রথম </pcat:pos_cat> <pcat:pos_cat name="noun" type="common" meaning="son"> সন্তান </pcat:pos_cat> ৷ </scat:sentence_cat> </cont:content_domain> </text>
XML Tagged Hindi Text is shown below.
<?xml version="1.0" encoding="UTF-16" standalone="yes"?> <!-- XML Tagged Hindi Text --> <text xml:lang="hin" xmlns:xsi="http://www.w3.org/2001/XMLSchema" xsi:noNamespaceSchemaLocation="E:\goutam-w3c-its\3lay.xsd" schemaLocation="E:\goutam-w3c-its\cd1.xsd E:\goutam-w3c-its\sencat1.xsd E:\goutam-w3c-its\poscat1.xsd" xmlns:cont="http://www.kolkatacdac.in/w3ci18ncd" xmlns:scat="http://www.kolkatacdac.in/w3ci18nsc" xmlns:pcat="http://www.kolkatacdac.in/w3ci18npc"> <cont:content_domain name="it"> <scat:sentence_cat name="interrogative"> <pcat:pos_cat name="noun" meaning="unicode"> यूनिकोड </pcat:pos_cat> <pcat:pos_cat name="pronoun" meaning="what"> क्या </pcat:pos_cat> <pcat:pos_cat name="verb" meaning="is"> है </pcat:pos_cat> ? </scat:sentence_cat> <scat:sentence_cat name="demonstrative"> <pcat:pos_cat name="noun" meaning="unicode"> यूनिकोड </pcat:pos_cat> <pcat:pos_cat name="adjective" meaning="each"> प्रत्येक </pcat:pos_cat> <pcat:pos_cat name="noun" meaning="letter"> अक्षर </pcat:pos_cat> <pcat:pos_cat name="post_position" meaning="for"> के लिए </pcat:pos_cat> <pcat:pos_cat name="adjective" meaning="a special"> एक विशेष </pcat:pos_cat> <pcat:pos_cat name="noun" meaning="number"> नम्बर </pcat:pos_cat> <pcat:pos_cat name="verb" type="compound" meaning="give"> प्रदान करता है </pcat:pos_cat> <pcat:pos_cat name="punctuation" type="sentence_final" meaning="."> । </pcat:pos_cat> </scat:sentence_cat> </cont:content_domain> </text>
XML Tagged Oriya Text is stated below.
<?xml version="1.0" encoding="UTF-16" standalone="yes"?> <!-- XML Tagged Oriya Text --> <text xml:lang="ori" xmlns:xsi="http://www.w3.org/2001/XMLSchema" xsi:noNamespaceSchemaLocation="E:\goutam-w3c-its\3lay.xsd" schemaLocation="E:\goutam-w3c-its\cd1.xsd E:\goutam-w3c-its\sencat1.xsd E:\goutam-w3c-its\poscat1.xsd" xmlns:cont="http://www.kolkatacdac.in/w3ci18ncd" xmlns:scat="http://www.kolkatacdac.in/w3ci18nsc" xmlns:pcat="http://www.kolkatacdac.in/w3ci18npc"> <!-- font utkal.ttf --> <cont:content_domain name="education"> <scat:sentence_cat name="demonstrative"> <pcat:pos_cat name="noun" type="proper" meaning="India"> ଭାରତ </pcat:pos_cat> <pcat:pos_cat name="verb" type="missing_auxiliary" meaning="is"></pcat:pos_cat> ଏକ <pcat:pos_cat name="adjective" meaning="great"> ମହାନ </pcat:pos_cat> <pcat:pos_cat name="noun" meaning="country"> ଦେଶ </pcat:pos_cat> <pcat:pos_cat name="punctuation" type="sentence_final" meaning="."> । </pcat:pos_cat> </scat:sentence_cat> </cont:content_domain> </text>
========================
References: 1. (Book) R.P. Sinha, Current English Grammar and Usage, Oxford University Press, 2003. 2. (Book) P.C. Wren, H. Martin and N.D.V. Prasada Rao, English Grammar, S.Chand & Company, New Delhi, 3. (Book)Bamandeb Chakraborty, Uchchotoro Bangla Byakaran. 4. http://www.w3.org/TR/its 5. Goutam Kumar Saha, "A Novel 3-Tier XML Schematic Approach for Web Page Translation," ACM Ubiquity, Vol 6(43), November 2005, ACM Press, USA. 6. Goutam Kumar Saha, "The EB-ANUBAD Translator: A Hybrid Scheme," International Journal of Zhejiang University Science, Vol 6A(10), October, 2005, ROC. http://www.zju.edu.cn/jzus/2005/A0510/A051007.pdf 7. Goutam Kumar Saha, "The E2B Machine Translation: A New Approach to HLT," ACM Ubiquity, Vol 6(32), August, 2005, ACM Press, USA. 8. Goutam Kumar Saha, "English to Bangla Translator - The BANGANUBAD," International Journal of Computer Processing of Oriental Languages," Vol. 18(4), pp. 281-290, December 2005, WSPC, USA. 9. Goutam Kumar Saha, "Parsing Bengali Text - an Intelligent Approach," ACM Ubiquity, Vol. 7 Issue 13, April, 2006. ACM Press, USA. URL: http://www.acm.org/ubiquity/views/v7i13_parsing.html 10. http://www.w3.org/TR/itsreq/ 11. Felix Sasaki, "From Characters to Web Services .. to Internationalization is Everywhere," ACM Ubiquity, Vol. 6(47), December 2005. 12. (Book) Yves Savourel, XML Internationalization and Localization, SAMS Publishing, Indianapolis, Indiana USA, 2001.