Interpreting Intervention Outcomes: Lexile-Based Assessments and Norm-Referenced Assessments

    by Elfrieda H. Hiebert | April 8, 2016

    The assessments mandated by states and the assessments offered to evaluate the efficacy of most interventions often use different methods for evaluating student performance. Several interventions, including READ 180 and Achieve3000, use assessments based on the Lexile Framework. Another intervention, iLit, uses a norm-referenced assessment called the Group Reading Assessment and Diagnostic Evaluation (GRADE). Are the results from these assessments comparable? For educators, how do the results from Lexile-based and norm-referenced assessments relate to the results of summative, end-of-the-year, state-mandated assessments?

    I address the question of comparability across assessments in this paper, starting with a description of the texts and tasks of summative assessments. The portrayal of the summative assessments is followed by descriptions of a Lexile-based assessment and a norm-referenced assessment. This background leads to the final section where I summarize the comparability of the assessments to one another as well as to summative assessments.

    Summative Assessments

    I will look at the texts and tasks of three summative assessments: (a) a national assessment (National Assessment of Educational Progress (NAEP)), (b) an assessment used by a consortium of states (Smarter Balanced Assessment Consortium (SBAC)), and (c) an assessment developed specifically by a state (Florida Standards Assessment (FSA)). For comparison, I have chosen the middle of the three grades targeted by the NAEP—Grade 8—and the middle of the middle-school years targeted by the SBAC and FSA—Grade 7.

    To understand the requirements of an assessment requires an examination of its texts and questions. Even though the Common Core State Standards (CCSS) proposed a three-part system for text complexity (quantitative, qualitative, and teacher judgment), a quantitative system—the Lexile Framework—dominates the analyses of text complexity in the marketplace. The Lexile Framework uses a unit called the Lexile (L) to determine growth rather than grade equivalents. A Lexile unit is defined as “1000th of the difference between the comprehensibility of the primers and the comprehensibility of the encyclopedia.” 1 The two variables that contribute to a Lexile are average sentence length and the average frequency of the words in a text.

    Additionally, the length of the texts and the number of tasks on assessments can influence students’ performances. For highly proficient readers, length of text may not be a factor, but less proficient readers may find it increasingly more challenging to sustain attention and comprehension as texts become longer. Students in the lower quarter of a school distribution have been found to perform reasonably on shorter passages but perform more poorly as texts get longer. 2

    Complexity of Texts in Summative Assessments

    The range of Lexiles expected for the middle-school period within the staircase of text complexity of the CCSS 3 is 925L to 1185L. Table 1 provides the overall Lexiles, mean sentence length, and mean vocabulary frequency of the texts in the three summative assessments.

    All three summative assessments fall solidly in the middle-school range of the CCSS staircase of text complexity. Of the three summative assessments, the FSA has the highest Lexile and the NAEP the lowest. However, a Lexile represents sentence length more than vocabulary. 4

    Yet, it is vocabulary, not sentence length that most accurately predicts challenges in comprehension for students. 5

    The vocabulary measure of the Lexile Framework is difficult to interpret because it uses a mathematical transformation. To provide a point of comparison, I have established the average vocabulary and average sentence length of texts at target Lexile levels in a database of almost 6,000 school passages (see Table 2). This information suggests that the vocabulary for all three summative assessments falls on the harder end of the scale (i.e., texts at the high school to college-career levels). From the vantage point of vocabulary, the SBAC is the most challenging of the three summative assessments.  

    Table 1. Averages of Features of Texts from Three Types of Assessments Intended for Grade 7


    (n of passages)




    Passage Length


    NAEP (Grade 8)

    (n = 9)





    FSA (Grade 7)

    (n = 3)





    SBAC (n = 4)





    SRI (n = 198)





    GRADE (n = 6)





    Table 2. Sentence Length and Vocabulary Levels Associated with Lexile Ranges 1, 2 

    Lexile Range

    Sentence Length (X)

    Vocabulary (X)






















    1Texts come from the TextProject TextBase (n = 5,877 texts)

    2Data are provided for Lexile bands that represent the range of school texts.  


    Table 3. Questions: Stems, Response Choices, and Correct Answers


    Question Stems

    Responses: Correct and Foils


    Sentence Length























    *could not be computed

    Complexity of Questions and Responses

    In addition to the complexity of texts, the complexity of the questions used in the assessment must also be considered. Table 3 provides information on the features of questions. Since sentence length influences the Lexile to a greater degree than vocabulary. 6

    the overall Lexile is meaningless for both the question stems and the response options. Consequently, Table 3 does not provide the overall Lexile for questions or responses.

    All three of the summative assessments are within the same range of average sentence length and are very similar in the average vocabulary. The vocabulary of the questions is less challenging than in the texts themselves.

    The information on the vocabulary of the responses should be viewed with caution. Vocabulary is computed within the Lexile Framework by giving every word in a text a score based on its frequency in the MetaMetrics databank. When responses consist of single words or phrases, the function words (i.e., determiners, conjunctions) have been eliminated. The vocabulary average is based only on content words. An example of this comes from the responses for one question on the NAEP: (a) take notes, (b) behave himself, (c) watch closely, and (d) sit quietly. Without the connective words that typically appear in text, the data on the responses cannot be compared to the benchmark information in Table 2.

    The response formats on the summative assessments are also difficult to evaluate because approximately 50–60% of the responses on the NAEP and SBAC are open ended. Students are evaluated on their ability to write responses, typically of several sentences in length, to questions.

    Length of Texts

    The NAEP and SBAC have passages of similar length—around 800 words. Individual passages on the FSA are shorter, but most passages are part of integrated tasks. In these integrated tasks, the questions require students to use information from two or three passages. Thus, the typical number of words that students need to read to answer a set of questions is 700–1050 words.

    Lexile-Based Assessments

    Two prominent interventions use Lexile-based assessments: Achieve3000 uses LevelSet 7 and READ 180 uses the Scholastic Reading Inventory (SRI). The present analysis is based on the SRI only.

    These Lexile-based assessments are developed by MetaMetrics (the company that sells text leveling according to the Lexile Framework). They are described as criterion-referenced assessments, which mean that students’ performance is compared to predetermined criteria, often called learning standards. In interventions such as Achieve3000 and READ 180, the typical criterion for growth on the Lexile-based assessment as a result of the intervention is typically a grade level beyond expected growth. A grade level is identified as 50L. Thus, the criterion for growth over a school year of the intervention is a 100L gain.

    The Nature of the Task

    Students choose a response from four choices to a missing word within an excerpt from a longer text. For example, a seventh grader might start with a passage at the 1000L level. If the student responds correctly, a passage at the 1050L level is given. If the student’s response at that level is incorrect, another 1000L passage is given. The assessment continues until a student has responded correctly to a designated number of passages from a particular level.

    All of the items on the SRI are similar, although the length of the text excerpts varies within and across Lexile levels. The items on the SRI cannot be reproduced, but Table 4 provides an item that mirrors a typical text excerpt with its accompanying question.

    Table 4. Illustrations of Texts and Question Types on Lexile-Based and Norm-Referenced Assessments 9

    Lexile-Based Assessments 10

    Norm-Referenced Assessments 11

        Farther along the wall, before another gate, Hector’s company, who should have been foremost in the attack, were hesitating. An eagle, the bird of Zeus, flying over, had dropped a live snake, red as blood, into their midst; this they took for an evil omen.


    Hector’s men were _____________ by the live snake.

    a. awakened

    b. frightened

    c. alerted

    d. guarded

       With the boats all gone, a curious calm came over the Titanic. The excitement and confusion were over, and the hundreds left behind stood quietly on the upper decks. They seemed to cluster inboard, trying to keep as far away from the rail as possible.

       Jack Thayer stayed with Milton Long on the starboard side of the Boat Deck. They studied an empty davit, using it as a yardstick against the sky to gauge how fast she was sinking. They watched the hopeless efforts to clear two collapsibles lashed to the roof of the officers’ quarters. They exchanged messages for each other’s families. Sometimes they were just silent.

       Thayer thought of all the good times he had had and of all the future pleasures he would never enjoy. He thought of his father and his mother, of his sisters and brother. He felt far away, as though he were looking on from some distant place. He felt very, very sorry for himself.

       Colonel Gracie, standing a little way off, felt curiously breathless. Later he rather stuffily explained it was the feeling when “vox faucibus haesit, as frequently happened to the old Trojan hero of our schooldays.” At the time he merely said to himself, “Good-by to all at home.”

       In the wireless shack there was no time for either self-pity or vox faucibus haesit. Phillips was still working the set, but the power was very low. Bride stood by, watching people rummage the officers’ quarters and the gym, looking for extra lifebelts.


    What is the story mostly about?

    a. People relaxing after a hard experience?

    b. People preparing for tragedy?

    c. People watching a sports event?

    d. People wondering what just happened?

    Comparison to Summative Assessment Tasks

    The following analysis focuses on the texts in the SRI for the three Lexile levels that are at the middle of the Grade 6–8 band on the staircase of text complexity: 1000L, 1050L, and 1100L. Information on the texts and questions/responses appear in Tables 1 and 3, respectively.

    There are three primary differences between the summative and Lexile-based assessments:

    • The length of passages. The passages used in the SRI to assess the mid-point of the middle-school band on the staircase of text complexity average 75 words. These passages are approximately 8.6 times shorter than the typical passages of the summative assessments.
    • The nature of the vocabulary in the passages. The average word frequency in the passages of the SRI is lower than the typical level on all of the summative assessments. The vocabulary in the SRI passages for Grade 7 has a word frequency level that is equivalent to those in passages for Grades 2–3, rather than middle-school passages (see Table 2).  
    • The nature of questions and responses. The summative assessments have question stems that are almost four times the length of the Lexile-based assessment. All of the response formats on the Lexile-based assessment are similar—four words with one correct choice.

    The system for analyzing Lexiles does not permit analyses of lists of words, which is the case with the responses on the SRI. Consequently, the complexity of the correct choices and foils (i.e., the alternatives to the correct choice) cannot be reported in this analysis.

    Norm-Referenced Assessments

    Norm-referenced assessments have traditionally been referred to as norm-referenced tests (NRTs). These assessments estimate of the position of the tested individual to a predefined population on a particular proficiency. To create a NRT, a large group of students at the target grade levels take the assessment. Items that do not have strong reliability are eliminated; when the assessment is deemed to give valid and reliable scores, a set of norms is established. A score at the 40th percentile means that the individual’s proficiency is higher than 39% of the population and lower than 60% of the population. When a score is presented as a percentile or a grade-equivalent, this score represents the student’s performance in relation to the norming sample—which has been chosen to be representative of the students who will take the assessment.  

    A recent addition to the field of NRTs is the GRADE. Concurrent and predictive validity was assessed using a variety of other standardized reading assessments (e.g., TerraNova, Iowa Test of Basic Skills, California Achievement Test).

    The Nature of the Task

    The GRADE consists of several sub-tests (including comprehension while listening and vocabulary) but the sub-test of focus here is passage comprehension. As described in Table 1, individual passages on the middle-grade form of the GRADE average 241 words, but a student’s standing in relation to peers is established by their performances on 6 passages (totaling approximately 1,500 words).

    Comparison to Summative Assessments

    • Length of texts. In that a student’s performance on the GRADE reflects their performances on all six passages on the assessment, the total number of words on the middle-school GRADE is a better reflection of the demands of the task than the looking at individual passages. Students’ performances on the GRADE reflect their ability to process approximately 1,500 words. The total number of words on the GRADE exceeds the average length of passages on all three summative assessments.
    • The nature of the vocabulary in the passages. The average vocabulary level of the GRADE passages—3.40—is in the range of the summative assessments. It also falls in the high school to college-career range in Table 2.
    • The nature of the questions and responses. Question stems on the GRADE are moderate in length—1.6 times shorter than the typical questions on the summative assessments but 2.35 times longer than the questions on the SRI. The vocabulary of GRADE questions is more challenging than the SRI but less challenging than the summative assessments.

    The vocabulary of the correct responses and of the foils (i.e., wrong answers) is comparable to the vocabulary of the summative assessments.

    Predicting Student Performances on Summative Assessment from Lexile-Based and NRT Assessments

    Comparing students’ gains over a school year as a result of an intervention with data from a Lexile-based assessment or an NRT is truly a case of the proverbial comparison of apples and oranges. But the critical question to ask has to do with the degree to which an assessment is valid and reliable. Can we predict that the findings from one assessment will generalize to other assessments? I ask this question about both Lexile-based and norm-referenced assessments.

    If a student is assigned a Lexile level of 1,500 on a Lexile-based assessment, does that mean that the student will do well on a summative assessment with a Lexile level of 1,500?

    No. There is no evidence that students’ performances on Lexile-based assessments transfer to comprehension on summative assessments with similar Lexile levels.

    Evidence for this statement comes from a study commissioned by the CCSS developers.8 In that study, the correlation between Lexiles and the grade band at which a text was assigned on a state assessment was in the vicinity of 0.26 for Grades 6–8 and 0.22 for Grades 9–11. On a norm-referenced assessment (SAT-9), the correlation of Lexiles to grade-level text assignments was in a similar range for Grades 6–8 (r = 0.24 but fell to 0.078 for Grades 9–11).

    Note that these correlations represent the ability of the Lexile Framework to predict the assignment of a text to a relatively large grade band—not an individual grade. These correlations are consistently low, indicating that the ability of the Lexile Framework to predict the levels of state and norm-referenced assessments is not high.  

    It should also be remembered that Lexiles can be obtained on any assessment or set of texts, including NRTs (as shown in Table 1 for the GRADE). Just because students typically read texts at a particular Lexile level is no guarantee that students will perform with comprehension on a passage with a similar Lexile level. The assignment of a Lexile level on a Lexile-based assessment is no guarantee that students will do well on passages with similar Lexile levels.

    There are several additional reasons why a Lexile assigned to students on a Lexile-based assessment cannot be assumed to generalize to their performances on summative assessments like the FSA or SBAC. First, the passages on which the Lexile is based do not mirror the passages on summative assessments in vocabulary complexity or length. Second, the Lexile level is based on a single task that is limited in scope and complexity—filling in a missing word in a short passage from four choices. All of the choices (and correct answers) are a single word. None of the summative assessments had tasks that were as limited in sophistication of question and response length. Third, a Lexile level is assigned with 75% comprehension. Choosing a single word from a set of four words is a very different task than writing a response to a question or selecting the correct sentence from a set of four sentences.

    In summary, just because students gain 100L on a Lexile-based assessment is no guarantee that they will perform well on a state assessment. Summative assessments require students to sustain attention over extended tasks with long texts and rare vocabulary. These conditions are quite different than those of the Lexile-based assessment in which passages are short and vocabulary relatively easy. Further, students will need to read substantially longer questions on the state assessments than on the Lexile-based assessment. Finally, the ways of responding to questions on summative assessments are substantially harder than the single task on Lexile-based assessments: picking one of four words.

    Is there evidence that students’ performance on an NRT will predict their performance on summative assessments?

    Yes. An NRT, such as the GRADE, has numerous features that align closely with the summative assessments, including but not limited to reflecting students’ stamina across extended passages, the nature of questions, and the types of responses.

    The GRADE is not perfectly aligned to the summative assessments—especially the new-generation assessments such as the FSA and SBAC—but it aligns closely with the tasks and the expectations of the summative assessments. The reliability and validity of the GRADE has been assiduously established, indicating that it captures the kind of reading that students at particular grade levels can do better than the new assessments. In saying that the GRADE represents students “real” reading levels, I am referring to the escalation of text complexity recommended in the Common Core State Standards (CCSS) and evident in the new assessments. Assigning higher levels of text complexity, as the CCSS has done, does not mean that the majority of the students have the capacity to read at these levels. The validation of the GRADE—using the performances of a large sample of students to establish what students can and cannot read—better captures the “real” reading levels of current students.

    Lexile-based assessments such as the SRI fail to assess two aspects of summative assessments that cause challenges for many American students:  stamina in reading extended texts and complex tasks in response to reading. By contrast, norm-referenced assessments better capture stamina for reading and responses to complex tasks. Until independent research studies show that performances on Lexile-based assessments predict performances on summative assessments at high levels, educators should be cautious about assuming that performances on Lexile-based assessments transfer to current summative assessments.  

    1 Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2007). The Lexile framework for reading (Technical Report). Durham, NC: Metametrics, p. 6.

    2 Mesmer, H.A.E., & Hiebert, E.H. (in press). Third graders’ reading proficiency with texts varying in complexity and length: Responses of students in an urban, high-needs schools. Journal of Literacy Research.

    3 Coleman, D., & Pimentel, S. (2012). Revised publishers’ criteria for the common core state standards in English language arts and literacy, grades 3–12. Retrieved from

    4 Hiebert, E.H. (November 30, 2012). Readability formulas and text complexity. Paper presented at the annual meeting of the Literacy Research Association, San Diego, CA. Retrieved from

    5 Davis, F. B. (1942). Two new measures of reading ability. Journal of Educational Psychology, 33(5), 365.

    6 Deane, P., Sheehan, K. M., Sabatini, J., Futagi, Y., & Kostin, I. (2006). Differences in text structure and its implications for assessment of struggling readers. Scientific Studies of Reading, 10(3), 257-275.

    7 MetaMetrics Corporation worked with Achieve3000 in creating this assessment. Based on information provided on the website, it appears to be similar in structure and format to the SRI.

    8 Nelson, J., Perfetti, C., Liben, D., & Liben, M. (2012). Measures of text difficulty: Testing their predictive value of grade levels and student performance. Report to the Gates Foundation. Retrieved from

    9 Simulations of the typical passages and questions for the two assessments are presented here. Passages and questions on both the SRI and GRADE are confidential and cannot be reproduced here.

    10 Sutcliff, R. (2005). Black ships before Troy. New York, NY: Laurel Leaf Publishers. 

    11 Lord, Walter (2005). A night to remember. New York, NY: Holt Paperbacks.