| Share: | More

The Generalizability of the TExT Model to Indic Languages

Posted by Freddy Hiebert on 6 October 2010

Shailaja Menon, Jones International University
GUEST COLUMN for TextProject, Inc.

Theoretical models of reading acquisition are based largely on empirical studies of alphabetic writing systems, most notably English. The implicit assumption in the past was that findings from the acquisition of English would generalize to other languages (Vaid & Padakannaya, 2004). This assumption has been tested over the past two decades by a considerable body of cross-linguistic literature that has compared reading processes in English to other alphabetic and non-alphabetic systems, such as, Spanish, German, Italian, Portuguese, Hebrew and Chinese (Leong & Tamaoka, 1998; Wimmer & Goswami, 1994; Zoccolotti, et al., 1999). However, considerably less is known about reading processes in syllabic and semi-syllabic writing systems, such as those used by a sizeable proportion of the world’s population. In this column, we consider the generalizability of features of the TExT model to alphasyllabic languages, such as those in use in India.

Perfetti and Liu (2005) distinguish among three levels of analysis of a written language that are applicable to this analysis. The broadest level is that of the writing system, which reflects the principles of the fundamental writing-language relationships, for example, an alphabetic versus a syllabic system. The next level is that of orthography, which, by contrast, express differences within a writing system. Thus, even though English and German are both alphabetic writing systems, they have different orthographies. Certain orthographies may be more or less shallow or deep, even within a given writing system (for example, German has a more transparent orthography than does English). The third level is that of the script, which is sometimes used to refer to one or the other of the broader levels; however, in Perfetti and Liu’s classification, a script refers only to the graphemic aspects of the symbols used to represent the language. Each of these levels can contribute independently to the reading process by influencing different factors related to it.

There are 29 different languages that are each spoken by more than a million speakers in India, of which 22 are recognized officially by the government (Census of India, 2001). These languages belong to at least four different language families, of which the two largest groups are the Indo-Aryan languages spoken in the North of India, and the Dravidian languages of the South. These two languages families are linguistically distinct; however, the writing system used is common, and descended from the Brahmi writing system.

The Indic scripts, as they are sometimes called, are alphasyllabaries or semisyllabaries that combine aspects of the syllabic and alphabetic systems. Like syllabic languages, the basic symbol unit, the akshara, maps on to phonology at the level of the syllable. At the same time, the akshara also has phonemic vowel markers (diacritics) that can transform the schwa vowel sound inherent in the consonant symbols, rendering it somewhat akin to alphabetic systems. Korean Hangul is another example of an alphasyllabary. There are several crucial features of the Indic alphasyllabaries that distinguish them from English. First, there is no difference between letter name and letter sound, such that akshara knowledge requires the mastery of a single akshara name-sound (Nag, 2007). Second, because there is a one-to-one correspondence between akshara and sound (at the level of the syllable), most Indian languages have symbols for approximately 35 consonants and between 12-16 primary vowel symbols. Each vowel sound also has symbols for secondary diacritics that are combined with consonants to produce unique sounds (e.g., /gu/, /sai/, /ko/, etc.). The orthography is very regular, highly transparent and rule-bound. Third, the visuo-spatial arrangement of syllables in the akshara script is very complex. The secondary vowel diacritics can be placed above, below, to the left, or to the right of the base consonant, and does not always follow the left-to-right linear sequencing of English.

The script might, therefore, lend itself to visual processing to a greater degree than the largely phonological processing of English, because syllabic boundaries are often visually apparent. In some of the Indo-Aryan languages of the North that use the Devnagiri script, a horizontal line is placed on top of each word, so that even word boundaries are visually apparent. The final feature of Indic languages that potentially influences reading acquisition is at the morphemic level of the spoken language. Several Indian languages, especially the Dravidian languages of the South, are extremely inflected and agglutinative, that is, a single word may be made up of several smaller morphemes, with each morpheme carrying its own unit of meaning. Thus, Indic languages differ from English at all three levels identified by Perfetti and Liu (2005): at the levels of writing systems (alphabetic versus alphasyllabaries); at the level of orthography (deep and irregular, versus shallow and transparent); and at the level of script (phonological versus visuo-phonological). Further, morphemic aspects of the spoken language may also influence the manner in which words are represented and acquired in the written scripts.

The TExT model identifies two dimensions of significance to the early acquisition of reading: linguistic content and cognitive load (Hiebert, 1999; Hiebert & Mesmer, 2005; Menon & Hiebert, 2005). How well do these dimensions generalize to reading acquisition in the Indic languages? There are barely a handful of studies on reading processes in these languages (see Vaid & Padakannaya, 2004), even fewer on reading acquisition (Nagy, 2007), and none on features of text that could support reading acquisition. Given the paucity of empirical studies related these topics we will use two strategies to fill in the gaps in our knowledge-base. First, we will borrow evidence from studies of the Korean language where available (since Korean Hangul is also an alphasyllabary). Second, we will make theoretical speculations where empirical evidence is not available.

Linguistic Content

The first dimension of the TExT model – linguistic content – identifies critical word-level content that texts can model to support beginning readers (Hiebert, 1999; Menon & Hiebert, 2005). Two features of significance are discussed here – rimes and high frequency words.

Rimes. Studies of reading in English have repeatedly established the utility of the rime unit (vowel plus coda of the syllable) during reading acquisition (Bowey, Vaughn & Hansen, 1998; Goswami, 1993; Goswami, 1995; Juel & Solso, 1981; Treiman, 1992). However, there is some evidence that in more transparent orthographies like Dutch (an alphabetic language), rime units are less useful to novice readers — raising the question of whether a more predictable orthography might be less reliant on rime units, than a less predictable one, like English (Perfetti & Liu, 2005). Moving to alphasyllabaries, the evidence is even more interesting. Summarizing a line of empirical work on Korean Hangul, Perfetti and Liu (2005) report that not only did Korean children not display a rime preference while reading; they actually displayed a preference for the syllable body (onset + vowel), in tasks that involved reading. Even when reading was not involved, and Korean children were orally presented with words, they judged both words and nonwords with shared syllable bodies (e.g. koon and koop) as more similar than stimuli with shared rimes (koon and poon). While there is no direct evidence about the utility of rimes in Indic languages, we can hypothesize that similar findings would hold for them. There are a couple of possible reasons for this. First, the orthography is more transparent than English, so that readers are less reliant on the rime unit for recognizing the vowel sounds within syllables. Second, the onset has a primary place in the akshara based writing system, with the vowels represented as diacritics that are visuo-spatially organized around the onset. For example, the consonant /g/ (which has an inherent schwa sound in it) would be transformed by the accompanying vowel diacritics into /ga/ /gi/ /gu/ /gai/ and so on. This would give salience to the syllable body in the reader/speaker/listener’s mind, especially if the speaker/listener has already received some instruction in the akshara-based system.

High frequency words. Rapid recognition speeds with familiar, high frequency words is viewed as critical to reading acquisition in the English language (Ehri & Wilce, 1983; Juel & Roper/Schneider, 1985; McCormick & Samuels, 1979; Perfetti, Finger & Hogaboam, 1978). Given the emergent stages of scholarship on Indic languages, we were able to locate only a single empirical study that directly addressed the utility of word frequency for the acquisition of these languages. Karanth, Mathew and Kurien (2004) examined the effect of word frequency effects on 15-45 year old, proficient readers of the Kannada script – a Dravidian language. These researchers did not see word frequency effects for orthographically simple words; however, word frequency did matter for orthographically complex words, of the CCVCC type. Would this result be generalizable to beginning readers? In the absence of reliable evidence we are hypothesizing that it would — that acquiring logographic representations of “whole words” might not be as critical or as efficient a way to acquire new words in a transparent and regular orthography, as it is in English. Nevertheless, it would seem commonsensical to assume that orthographic representations of highly frequent words stored in the memory of proficient readers would be more stable than those of low frequency words, and hence would have shorter Reaction Times. However, this study failed to find any indication of stability of orthographic representations of the more highly frequent words when the orthography of the words was simple. We are speculating that the highly inflected and agglutinative nature of Dravidian languages might play a part in this. Kannada (the language used in this study) is not as inflected as some of the other Dravidian languages, but it is more so than Indo-Aryan languages, and English. In such languages, it might be challenging to store stable orthographic representations of whole words, given the number and variety of forms that each “root” word can acquire, depending on the context, and the number of morphemic units that get attached to it. The results might be different for less inflected languages.

Cognitive Load

The second dimension of the TExT model – cognitive load – attends to text features that determine its difficulty level for the reader (Menon & Hiebert, 2005). The cognitive demands of the akshara writing system are insufficiently understood. A promising line of research that is currently underway examines a variety of phonological, visual, oral, and spelling skills in 8-12 year old children learning to read and write in the Kannada language (Nag, 2007; Nag, Trieman, & Snowling, 2010). Preliminary findings from this line of work reveal that the basic challenge in acquiring literacy in Kannada lies in acquiring the extensive akshara set that has 474-476 symbols that combine consonants with specific vowel sounds. Akshara acquisition entails learning the rules of ligaturing the vowels to the consonants, and sometimes, the consonants to each other, in complex visuo-spatial arrangements. The acquisition of the writing system continues well into Grades 4 and 5, and moves from learning the CV, to the CCV akshara set. The key points of difficulty for young/poor readers are: (1) acquiring a firm knowledge of the extensive set of aksharas; (2) remembering the appropriate diacritic marks for different vowel sounds; (3) assembling all phonemes in a consonant cluster into an akshara based on ligaturing rules, with CV clusters being easier to acquire than CCV clusters; and (4) longer words. It is likely that these difficulties are not specific to the Kannada language alone, but might generalize across Indic languages. Evidence obtained from the reading of dyslexic children in Hindi (an Indo-Aryan languages) echo some of these patterns (Vaid & Gupta, 2002; Gupta, 2004) with word length, errors related to ligaturing rules of CC and CCV clusters, and vowel substitutions and deletions occurring more frequently among dyslexic, as compared to normal readers. The majority of these errors was graphemic rather than phonological in nature, and involved vowels more often than consonants.

Implications for a Language Specific TExT Model

The question addressed here is whether the features of the TExT model developed on readers of English are generalizable to the Indic languages. From this brief review of an emergent and patchy literature base, it would appear that both linguistic content and cognitive load have potential in elaborating critical scaffolds for reading acquisition in Indic languages. However, the specific features included in each of these dimensions might vary across these languages.

Linguistic Content in Indic Languages. In writing systems that use the akshara, the syllable body, rather than rimes could be the crucial units for repetition and instruction. Automaticity with orthographic representations is a robust predictor of reading ability across languages (Georgiou, Parrila, & Liao, 2009). Perhaps, the primary unit for acquiring automaticity in the Indic languages is not the whole (high frequency) word, or even high frequency rimes, but stable, highly frequent CV and CCV akshara configurations. It would appear that young and poor readers of these languages need systematic and repeated exposure to the extensive set of aksharas, the diacritic marks, and to the ligaturing rules before they can acquire them to the point of automaticity.

Cognitive Load in Indic Languages. Word length appears to be consistently related to difficulty with decoding text in both the languages examined here – Kannada and Hindi. In addition, the available literature seems to suggest that word decodability shows a progression from CV to CCV words in these languages. It is also highly probable that word decodability is influenced by visually more versus less complex configurations.

In conclusion, we concur with Perfetti (2003) that all languages share certain universal rules and constraints for their acquisition, even though their specific manifestations might vary. In Indic languages, texts with shorter words, more CV words, fewer unique akshara configurations and vowel diacritic marks, may make the task of reading less challenging for young and poor readers. Further, sets of texts that model certain linguistic content consistently (such as highly frequent akshara configurations and ligaturing rules) may support the acquisition of reading in these languages, as in English. In the absence of a critical body of empirical evidence, the suggestions presented here should be interpreted cautiously as theoretical speculations that warrant empirical examination. Differences among the Indic languages also deserve further attention and study.


Bowey, J. A., Vaughan, L., & Hansen, J. (1998). Beginning readers’ use of orthographic analogies in word reading, Journal of Experimental Child Psychology, 68(2), 108-33.

Census of India (2001). Census Data Summary. Retrieved from http://censusindia.gov.in/2011-common/CensusDataSummary.html

Ehri, L. C., & Wilce, L. S. (1983). Development of word identification speed in skilled and less skilled beginning readers. Journal of Educational Psychology, 75, 3-18.

Georgiou, G., Parrila, R., & Liao, C. H. (2008). Rapid naming speed and reading across languages that vary in orthographic consistency. Reading and Writing: An Interdisciplinary Journal, 21, 885-903.

Goswami, U. (1993). Toward an interactive analogy model of reading development: decoding vowel graphemes in beginning reading. Journal of Experimental Child Psychology, 56, 443-475.

Goswami, U (1995). Phonological development and reading by analogy: What is analogy, and what is not? Journal of Research in Reading: Special Issue: The contribution of psychological research, 18(2), 139-145.

Gupta, A. (2004). Reading difficulties of Hindi-speaking children with developmental dyslexia. Reading and Writing: An Interdisciplinary Journal, 17, 79–99.

Hiebert, E.H. (1999). Text matters in learning to read. The Reading Teacher, 52, 552-568. [Augmented with foreword in N.D. Padak et al. (Eds.), Distinguished educators on reading (pp. 453-472). Newark, DE: IRA.]

Hiebert, E.H., & Mesmer, H. (2005). Perspectives on the difficulty of beginning reading texts. In S. Neuman & D. Dickinson (Eds.), Handbook of Research on Early Literacy (Vol. 2, pp. 935-967). NY: Guilford.

Juel, C., & Roper/Schneider, D. (1985). The Influence of Basal Readers on First Grade Reading. Reading Research Quarterly, 20(2), 134-152.

Juel, C. & Solso, R. L. (1981). The role of orthographic redundancy, versatility and spelling-sound correspondences in word identification. In M. L. Kamil (Ed.) Directions in reading: Research and Instruction (30th Yearbook of the National Reading Conference, pp. 74-92). Rochester, NY: National Reading Conference.

Karanth, P., Mathew, A., & Kurien, P. (2004). Orthography and reading speed: Data from native readers of Kannada. Reading and Writing: An Interdisciplinary Journal, 17, 101–120.

Leong, C-K. & Tamaoka, K. (1998). Cognitive processing of Chinese characters, words, sentences and Japanese kanji and kana: An introduction. Reading and Writing, 10(3–5), 155–164.

McCormick, C. & Samuels, S. J. (1979). Word recognition by second graders: The unit of perception and interrelationships among accuracy, latency, and comprehension. Journal of Reading Behavior, 11, 107-118.

Menon, S., & Hiebert, E. H. (2005). A comparison of first-graders’ reading acquisition with little books or literature anthologies. Reading Research Quarterly, 40(1), 12–38.

Nag, S. (2007). Early reading in Kannada: the pace of acquisition of orthographic knowledge and phonemic awareness. Journal of Research in Reading, 30(1), 7-22.

Nag, S., Treiman, R., & Snowling, M. (2010). Learning to spell in an alphasyllabary: The case of Kannada. Writing Systems Research, 2, 41–52.

Perfetti, C. (2003). The universal grammar of reading. Scientific Studies of Reading, 7(1), 3-24.

Perfetti, C., Finger, & Hogabaum, T. (1978). Sources of vocalization latency differences between skilled and less skilled young readers. Journal of Educational Psychology , 70(5), 730-39.

Perfetti , C. A., & Liu , Y.(2005). Orthography to phonology and meaning: Comparisons across and within writing systems. Reading and Writing, 18, 193–210.

Treiman, R. (1992). The role of intrasyllabic units in learning to read and spell. In P. B. Gough, L. C. Ehri, & R. Treiman (eds.), Reading acquisition (pp. 65-106). Hillsdale, NJ: Erlbaum.

Vaid, J., & Padakannaya, P. (2004). Introduction. Reading and Writing (1-2), 1-6.

Vaid, J. & Gupta, A. (2002). Exploring word recognition in semi-alphabetic script: The case of Devanagari. Brain and Language, 81, 679–690.

Wimmer, H.M. & Goswami, U. (1994). The influence of orthographic consistency on reading development: Word recognition in English and German children. Cognition, 57, 91–193.

Zoccolotti, P., De Luca, M., Di Pace, E., Judica, A., Orlandi,M. & Spinelli, D. (1999). Markers of developmental surface dyslexia in a language (Italian) with high grapheme–phoneme correspondence. Applied Psycholinguistics, 20, 191–216.