Measurement should be considered a key aspect of research method in education. As it was noted by Towne and Shavelson (2002), when variables or research concepts are poorly measured, even the best research methods would not be able to provide support for scientific inferences. Education, in particular, has provided multiple concepts and variables that serve as objects for research, as well as multiple methods that serve as tools to measure them. However, in education, a challenge is presented by measurement validity and reliability. There are multiple explanations for this. It may often be the case that there is no clear definition of a particular variable. This impairs the choice of correct measurement tools. Sometimes the choice of measurement instruments fails to capture the defined concept or variable that is measured. Moreover, there is often measurement error that severely impedes the accuracy of conducted research and hence limits the validity of the conducted analysis. Especially true to educational research, it may often be the case that the use of specific measurement in scientific research would have unexpected social consequences (for example, the introduction of tests evaluating teachers’ performance may lead to improvement in quality of education). Finally, in social sciences, measurement error resulting from human error is nearly inevitable since in most cases, measurement calls for human interaction. This leads to common recognition of the need to be objective towards the object of analysis (Verma, Mallick, & Neasham, 1998).
This section of the study is dedicated to devising an operational definition of measurement that is based on existing definitions of the concept. It will also offer an insight into the function, matter, and scope of measurement.
Measurement is a complex concept, and an adequate definition of it would be of great significance to this study. Broadly, measurement should be understood as the process of “classification of the objects and events in which a certain sign (numeral, letter, or word) is assigned to each defined class” (Berka, 1983). Measurement is often misinterpreted as mere quantification. The former should be understood as a process of assigning numbers to objects or to gradations within a certain concept. Misunderstanding of measurement may lead to such common misconceptions as, for example, the assumption that IQ tests can be employed to measure intelligence and hence that IQ tests can establish a quantity that is measured (Gorard, Roberts, & Taylor, 2004). There should be a clear relationship between the variable and concept that are measured and the properties of numbers that are used for measurement purposes.
The process of measurement should, thus, be characterized not only through the object that is measured but also through the results of measurement, and empirical operations that mediate the process. These empirical operations should be understood as such that concern derived measurement, or the type of measurement that relies on fundamental measures to obtain results of measurement actions. Examples of such fundamental measures include mass and length while derived measures use, for example, mass and volume to obtain density. The object that is measured and the results of measurement are, by no means, controversial.
The most commonly cited definition of measurement belongs to Campbell. The scholar defines measurement as “the process of assigning numbers to represent qualities” (Campbell, 2013). Bertrand Russell stated, “measurement of magnitudes is in its most general sense any method by which a unique and reciprocal correspondence is established between all or some of the numbers, integral, rational, or real as the case may be” ( as cited in Turner & Risjord, 2007). Stevens stated that measurement involves quantification of phenomena in accordance with the specific rules (as cited in Polansky, 1975).
Hence, it is possible to conclude that this subsection by suggesting that measurement has the following properties. First, the process of measurement calls for assigning numbers to variables or concepts in accordance with the previously established rules. These rules should be explicit and consistent throughout the process of measurement. Second, measurement is not concerned with measuring the object itself; rather, it is focused on evaluating specific properties and attributes of this object, its features. Third, quantification is an attribute of measurement, its component; however, it should not be envisioned as its sole purpose and essence. Thus, operational definition of measurement that would be used in this study can be formulated as follows: measurement is a process that involves assigning of symbols to the properties specific variables or phenomena that are observed either directly or indirectly in accordance with the previously determined consistent rules and operations. Thus, if there is no criterion that determines whether a given numeral should or should not be assigned, then it is not measurement.
One of the obvious difficulties that were discussed in the previous subsection lies in establishing a relational system between the numbers used to quantify properties of objects and the properties of measured objects. These difficulties should lead to a discussion of the matter of measurement. It lies within the scope of complex connections between objects that are used to conduct the process of measurement and the features or properties of objects that are measured. In other words, a subject matter of measurement lies within the clear distinction between the measured object and the object of measurement that are not to be identified as mutually exclusive factors (Berka, 1983).
The primary function of measurement is to serve as a mediator between theory and practice or “between empirical knowledge and its mathematical expression” (Berka, 1983). Hence, measurement should be envisioned as an empirico-mathematical research method since it is impossible to justify measurement only through its empirical function just as it is impossible to assume that measurement serves solely mathematical purposes.
Scope of a single measurement should be defined as “a ratio of its magnitude to its precision” (Schneider, 2009). Scope is the concept that reveals differences between the four scales in measurement, precisely, between nominal, ordinal, ratio, and interval scales. Every instrument of measurement has scope as well, and it should be understood as “the ratio of its maximum related to its resolution” (Schneider, 2009). It is important to state that scope is the result of management activities, and it should not be considered the property of the measured object itself. In natural sciences, however, it is possible to suggest that measured objects have their own scope. In case an object has geographical boundaries, then its scope can be defined as the ratio of “diameters of the largest and smallest cases” of this object (Schneider, 2009). Sometimes, certain social phenomena (such as labor migration) may also have geographical scope. However, in social sciences objects are rarely characterized by their own scope and hence such convenient tools of analysis like scope diagrams are rarely employed.
To continue the discussion of the relation between scope and measurement scales, it is useful to point out that the four types of measurement scales have different scope and hence different information content since scope is an indicator of the information content of a single measurement. The nominal scale that has only one step is the least informative. The ordinal scale that is restricted to the number of objects that are measured and, thus, compared is more informative. However, it has a restricted number of “steps” while interval and ratio measurement scales are to be considered more informative than the ordinal and nominal scales (Schneider, 2009). The following subsection will continue the discussion on measurement and scaling.
This subsection will discuss the concept of a scale, as well as provide a classification of scales in measurement along with their analysis.
Scale is one of the concepts of measurement that has a lengthy history. Kariya defines scale as “a part of indicating device consisting of an ordered series of marks and located at some of them scale numberings and numerals, which correspond to values of a quantity” (Kariya, 2000). In order to understand the concept of scale, it is possible to employ mathematical notations. Assume that there is a series of object property manifestations that form a series of values and, thus, are the observed empirical elements that serve as the measured object, and let us denote these property manifestations with a. Assume that there exists a series of numerical objects or scale numberings that shall be denoted as b. Then, according to the previously derived definition of measurement, there should exist a function f such that b=f(a). This function serves as a set of rules composed of certain conventions that are used to map the relation between a and b or between empirical observations and numerical notations on a scale.
Furthermore, there exists a subset of reference elements such that for these elements the value of a and b is known before the set of rules was derived for the rest of the elements that comprise the set of empirical observations or manifestations of properties of the object under analysis. These elements are, therefore, used for the purpose of generalization. The following diagram by Kariya represents the explanation above (Kariya, 2000)
In educational measurement, validity is the concept that “refers to the degree to which a test, tool or technique measures what it is supposed to measure” (Verma, Mallick, &Neasham, 1998). It can readily be inferred that there is no universally valid test. In other words, any test may be valid in one situation and be incorrect in another. Each test has its own connotation of validity. In terms of educational measurement, validity refers to the extents of using measurement results as it was initially intended (Mcnamara, Erlandson, & Mcnamara, 1999). In terms of educational measures, it is also inappropriate to state that a measure can be either valid or not valid; rather, a measure is valid for a particular intent and for a particular group. Therefore, the primary question is not whether a measure is valid but rather whether it is valid for the chosen properties of the measured object, as well as a theoretical constructs that it is going to reflect.
There are three types of measurement validity that are commonly distinguished in educational research: content validity, criterion-related validity, and construct validity (Zeller & Garmines, 2009). The primary role in developing tests in educational measurement has traditionally been given to content validity. The concept of content validity refers to “the extent to which a set of items taps the content of some domain of interest” (Zeller & Garmines, 2009). The set of items is said to be content-valid to a certain degree to which this set reflects a full domain of interest. There are two steps that are identified as necessary in obtaining content validity. The first step involves “specifying the domain of content” for the test that would be conducted (Zeller & Garmines, 2009). For example, a test in elementary arithmetical operations has addition, subtraction, division, and multiplication as its domain. The second step involves selecting items that are associated with the domain of content.
The primary problem with content validity is that there is no universal agreement on the criteria that should be checked to demonstrate that a specific measure has attained content validity. Therefore, content validity is by and large an appeal to reason of the researcher who should make his or her own decision regarding sampling the content of interest and reflecting the domain of this content in tests. While it may be easy to construct tests that measure reading aptitude or arithmetical skills, more complex social constructs are harder to measure and hence measurement instruments that would be employed to measure such constructs would have only vague content validity.
The criterion-related validity “concerns the correlation between a measure and some criterion variable of interest” (Zeller & Garmines, 2009). For example, a language skills test is, thus, considered valid provided that there is a high correlation between one’s ability to speak, read, and write in the language that was tested and his or her test score. The criterion-related validity is determined only through this correlation between the chosen measure and its criterion. Whenever this correlation is high, the given measure is considered to be valid for the chosen criterion. Since there are many criteria for each measure, there is no single criterion-related coefficient of validity. According to Zeller & Garmines (2009), there are two types of criterion-related validity, namely, concurrent validity and predictive validity. The former is assessed through correlating a measure and the criterion for the same point in time while the latter is assessed by correlating a future criterion with a relevant measure (Zeller & Garmines, 2009). Job-screening tests or university entrance exams are primarily concerned with predictive validity since they are designed to predict future performance of an applicant.
While it may seem that the logic underlying criterion-related validity is clear and appealing, there are limitations that prevent using this type of validity in educational research. The primary problem is that in educational research there exist few – if any – reliable criterion variables against which tested variables could be evaluated. It can also be suggested that the more abstract is the concept that is tested (for example, self-esteem), the harder it would be to find an appropriate criterion variable for assessing its measure (Zeller &Garmines, 2009). Therefore, criterion-related validity is merely inapplicable for many abstract concepts that are employed in social sciences.
Construct validity should be defined as “the degree to which an instrument successfully measures a theoretical construct or an unobservable characteristic or trait” (Grinnell & Unrau, 2010). According to Schutt (2006), construct validity is used in social research when there is no clear criterion to establish the criterion-related validity. Construct validity may be established by using results of previously conducted research that will allow showing that a chosen measure has an established connection to other theoretically-specified measures. Schutt (2006) stated that two other approaches to construct validation exist: convergent validity and discriminant validity. The former is established through associating the chosen measure of a concept with different existing types of measures of the same concept while the latter involves comparing measurement results to measures of related concepts. Discriminant validity is, thus, achieved when the chosen measure is not related to measures of those other concepts.
According to Meyer (2010), measurement reliability should be defined as “the extent to which test scores are consistent with another set of test scores produced from a similar process”. This definition could be extended by suggesting that similarity between two processes should be established through similarity in selecting items for the test, as well as similarity in the data collection processes. It should be noted that complete understanding of measurement reliability is possible only through the complete understanding of the process by which measurement results were obtained.
According to Mcnamara, Erlandson and Mcnamara (1999), a clear distinction should be made between measurement validity and measurement reliability. Reliability, as it was already stated, is concerned with measurement consistency while validity concerns itself with the intent or purpose of the chosen educational measure. Consistency of an educational measure is necessary to obtain valid results. However, it is important to note that it is possible to design a consistent measure that will provide wrong results, and thus, it is possible to have reliability without validity. In other words, reliability is a necessary, but there is no sufficient condition for validity. Furthermore, validity is specified by the intent of measurement. A measurement can have high validity only if it is consistent, or, in other words, when errors of measurement are eliminated. Thus, it is possible to obtain valid results of measurement only if reliability is high.
Hence, measurement reliability is closely connected to the concept of repeated measurement, which is also referred to as a form of reliability. This concept can be most simply defined as the ability to measure properties of the same variable or concept at different points in time using the same measurement instrument which should provide the researcher with the same measurement results when it is used with the same measured object (Muijs, 2011). In this respect, measurement reliability is established through using the test-retest method or the method that refers to asking the same respondents the same set of questions after a period of time elapses. In this case, reliability is established through estimating the correlation coefficient between measurement results at two different time points. According to Muijs (2011), a correlation coefficient of 0.7 is typically sufficient for research purposes while a coefficient that is higher than 0.8 is typically required when measurement results produce a significant impact on decisions being made. Inter-rater reliability is also considered another form of repeated measurement. It refers to a consistency between rates of a situation given by several observers. Another form of reliability is internal consistency. This form of reliability refers to instruments that are complex and have more than one item (such as language aptitude tests that are composed of reading tests, writing skills tests, and speaking tests). It refers to how homogenous those items are in terms of measuring properties of the same variable. A common method to establish this form of reliability is split-half reliability that requires estimating a correlation coefficient between measures provided two even half of the items that comprise a single instrument.
In order to increase measurement reliability in educational research, it is possible to ensure that tests that are constructed for measurement purposes have unambiguous and clear questions. Another way to improve measurement reliability is to use complex instruments to measure a single variable since an increase in the number of items that comprise the given instrument suggests obtaining more data from a given respondent and thus error minimization. Finally, to increase reliability of a measure, it is possible to measure a construct that is clearly and narrowly defined.
Measurement reliability is concerned with consistency of measurement outcomes and the effect of error on measurement results. In research, measurement involves error. This subsection will discuss two types of measurement errors: random error and systematic error. Random errors are those that arise from chance, and their effect on measurement outcomes is unpredictable. Systematic errors affect measurement outcomes in a foreseeable way. Random errors lead to problems with measurement reliability while systematic errors lead to complications with measurement validity.
According to Ary et al (2010), the following three sources of random error can be identified. First, the individual whose properties are measured may be a source of error. This may result from fluctuations in individual fatigue, physical health, motivation, or even self-esteem. These factors may change randomly and may lead to the unpredictable outcomes. For example, hungry students may score better on tests given that they will know they will be fed afterwards. The second source of random error is the researcher or the individual who administers measurement process. An individual who lacks proper experience or instruction in terms of the measurement procedure may fail to ensure that the rules that were previously defined as guiding the measurement process are being followed. Vague testing instructions may severely impair test results. Finally, the instrument that is used for the measurement purposes may be a source of error. The major source of unreliability is the limited number of questions on the test that may fail to provide necessary information concerning variables that are measured.
When a researcher administers a test to a student, he or she obtains an observed score of the test. However, in a situation when a retest is done on the same student, a new score is obtained. The difference between the two scores can be attributed to random errors. However, if it is assumed that a hypothetical situation of error-free measurement is possible, then it is also possible to suggest that there is a true score of any individual on a given test. According to Ary et al (2010), the true score is conceptualized as “the hypothetical average score resulting from many repetitions of the test or alternate forms of the instrument”. Hence, each test score has two components: error component and true score component. The error component may result from random errors or systematic errors, or a combination of the two. It is possible to use the following notation to denote any test score: S=T+E, where S is the obtained test score (or measurement outcome), T is the true score, and E is the error component. It is also possible to suggest that a true score component of the formula is the score that an individual would have obtained had measurement conditions been perfect. The error component of the formula may be either positive or negative. It may, therefore, increase or decrease the true score of the individual and, thus, lead to different measurement outcomes. In statistics, it is generally assumed that the test had been repeated an infinite number of times, the sum of all errors would be zero. Hence, the true score of an individual on a given test would be the mean of all test scores that would have been obtained even if the test had been repeated an infinite number of times. However, in the usual research situation, it is not possible. Hence, it is possible to assume that in the regular research environment, variance in the observed scores of a given large group of research subjects can be expressed as the sum of true variances and variances in the measurement errors.
Systematic measurement error may cause bias in research outcomes. There are different sources of systematic errors in different types of research. In both qualitative and quantitative research, systematic error may result from using the data that reflects prior expectations formed by the researcher. However, in quantitative research this type of measurement error may arise simply because the researcher does not have unbiased data readily available (King et al., 1994). In qualitative research, the primary source of measurement error is a subjective evaluation that was made by the researcher. This type of bias is referred to as “confirmation bias” (Johnson & Christensen, 2011).
According to King (1994), it is possible to avoid systematic errors by trying to use judgments that were formulated by other researchers for different purposes. By doing so, it is possible to depart from the initial research hypothesis and ensure that measurement that is conducted would not be impacted by one’s own research assumptions. This is a common approach in quantitative research; however, it is also applicable in qualitative studies. The following subsection would discuss the differences between qualitative, quantitative, and mixed research in greater detail.
Quantitative research is primarily employed for analyzing and evaluating theories that have previously been formulated regarding the occurrence of a specific phenomenon. Typically, hypotheses in quantitative research are formulated before data is gathered. Findings in quantitative research are generalizable (meaning that they can be extended to describe the whole population, not only to the sample that was analyzed) when the analyzed sample was large enough and data was collected randomly. Quantitative research is used to determine causal relationships and scientific laws. One of the primary advantages of quantitative research is ease of data collection and analysis since the latter can be conducted using statistical software (Johnson & Christensen, 2011). Furthermore, it is easier to avoid bias in measurement outcomes by ensuring that analysis results are statistically significant (thus, using criterion-based validity).
However, the primary weakness of quantitative analysis is that quite often, it is not possible to obtain quantitative data that would precisely reflect certain phenomena. Variables that are missing have to be proxied. For example, in the economic growth models, population intelligence is often proxied using an average duration of scholarly education or IQ test results. Furthermore, since quantitative studies deliver more robust results when they are conducted on large samples, it may be hard to apply them to specific local situations and contexts (Johnson & Christensen, 2011).
How it Works
Select the type of assignment
Provide explicit guidelines
Enjoy your free time while our professionals work on your project
Get an original work
According to Mertens (2009), prior to launching the measurement process in quantitative research, it is necessary to describe all variables that would be measured and how they will be operationalized. Operationalization requires “specification of the procedures that will be used to classify or measure the phenomenon that will be analyzed” (Blaikie, 2003). In quantitative research, the way a specific concept or property is defined and, thus, measured may seriously affect research outcomes. Hence, it is of major importance to address the measurement validity and reliability issues.
Unlike quantitative research that employs data that is considered meaningful by the researcher (and thus is prone to confirmation bias), qualitative research uses data that are based on research participants own understanding. Qualitative research is useful when it is necessary to conduct an in-depth study of a limited number of cases: it can be useful in describing complex phenomena and allows case-by-case comparison. Another benefit of qualitative research is that it provides an opportunity to analyze phenomena in the local contexts since the data collection is normally conducted in the natural contexts (Johnson & Christensen, 2011). However, qualitative research results may be affected by changes in conditions of the data collection (such as room temperature).
One of the primary disadvantages of qualitative studies is that research results may not be generalizable to other situations or settings. This means that results of the study may be unique to people participating in it. Furthermore, it is difficult to use qualitative studies to generate reliable predictions. Finally, qualitative studies are more time-consuming when it comes to the data collection and analysis.
Measurement in qualitative studies uses naturalistic methods. Measurement tools, such as tests, are developed by the researcher to fit the goals of the study (Lodico et al., 2010). Measurement results are typically presented in the form of a narrative or graphics. Different types of data are recorded using different tools; furthermore, subjective experiences of the researcher are also recorded.
Mixed research methods allow employing strengths of qualitative and quantitative studies. In mixed research, it is possible to cover a single research purpose by using different methods of the data collection and analysis. This automatically suggests that in mixed research, it is possible to conduct a deeper and fuller analysis of a single research problem and cover a broader range of research questions. Furthermore, when qualitative and quantitative research is used in a two-phase sequence, it is possible to overcome weaknesses of one method by using strengths of another one. In mixed research, it is possible to arrive to a stronger conclusion through convergence and corroboration of findings (Johnson & Christensen, 2011). Generally, a mixed research design is obviously better in terms of adding useful insights and understanding of the problem at hand compared to the situation when a single method is used.
However, mixed research has its own weaknesses. For instance, it may be complex for a single researcher to conduct both qualitative and quantitative analysis of the problem. Thus, a mixed research design may call for a research team. In order to conduct mixed research, the researcher has to be aware of strengths and weaknesses of research methods that are used in the chosen research design. This preparation of the research design, data collection, and analysis are more time-consuming compared to a situation when a single research method is used. Measurement procedures in mixed research studies use methods that are applicable for both quantitative and qualitative analysis.
This section of the study is dedicated to analyzing theoretical foundations of measurement, precisely, the representation theory of measurement, and the representation problem and representation theorem.
According to Domotor and Batitsky (2008), representational theory of measurement (RTM) specifies and justifies “the basic procedures for assigning numbers to objects or events on the basis of qualitative observations”. Hence, RTM assumes that the world itself is non-quantitative, and researchers observe quantities of the world and assign numerical values to them. The primary question of measurement can be formulated as follows: “What is it about natural systems and/or observational interactions therewith that justifies our correlating the system’s attributes with such-and-such numbers?” (Domotor & Batitsky, 2008). While mathematical foundations of RTM would be provided in the next subsection, it is important to focus on the answer that RTM provides to the stated fundamental measurement question. RTM attempts to develop representation theorems for various measurement instruments and assures that for every observable phenomenon or variable or, in a sense, for every qualitative relation found in the real world there is an equivalent mathematical notation or operation that assigns numerical values to these relations. Hence, RTM links numerical structures directly to qualitative attributes of the real-world variables or objects. By doing so, this theory proposes that there is a certain fundamental attribute of qualitative relations of the measured objects, and hence, there is no need to derive theoretical assumptions for measurements. It should be stated that such an approach to measurement seems very appealing in sciences, such as physics. However, the usage of this approach in social sciences is severely limited by the very unobservable nature of the qualities of objects that are measured. Nevertheless, RTM that was developed in response to the major problem of measurement, e.g., how we correlate various properties of the world with numbers, is currently the primary theoretical framework of measurement science.
According to Zuse (1998), the primary problem of measurement is the representation problem. This problem can be formulated as follows. Suppose there exists a numerical relational system B and an observed empirical relational system A. The problem is to find necessary and sufficient conditions of homomorphism from A into B. These conditions have to be testable (or verifiable empirically). These conditions are normally called axioms for representation (or rules by which measurement occurs; they give conditions under which measurement can be conducted theoretically, and the theorem that satiates the sufficiency for a given measurement instrument and a given set of empirical qualities is called a representation theorem. These axioms could be thought of “as conditions that must be satisfied in order for us to organize data in a certain way” (Zuse, 1998).
It can readily be inferred that a representation theorem is necessary in order to determine the scale of measurement. A representation theorem, thus, contains conditions for the homomorphism discussed above. A simplest example of a representation theorem would, therefore, be the algebraic difference structure of a ratio scale, or the ranking homomorphism of an ordinal scale (Zuse, 1998). The ordinal scale would require functions that are more complex than ranking or algebraic difference.
This section of the study is focused on the role of measurement in educational research. It presents basic principles for objective test construction and outlines primary considerations in the test construction process. It also analyzes the major sources of bias in test construction and administration, as well as outlines potential ways to avoid such biases.
It is hard to underestimate the importance of testing in educational research. Tests provide evidence that can be employed to make educational decisions, such as changes to teaching programs or students’ aptitude. At present, education evaluation has shifted from the essay-based questions to using the objective tests. There are several advantages of the objective tests. Firstly, they can be corrected by different teachers without impairing accuracy of the grade. Secondly, the objective tests have less bias in responses since they call for short answers. Thirdly, if the items included in an objective test reflect the properties of the variable or concept that is being measured, then an objective test has high reliability and validity compared to the open-ended questions or essay-type questions. Lastly, an objective test takes less time since it provides short answers or multiple-choice answers. However, an objective test is complex to construct since in order for it to have high validity and reliability the researcher has to follow specific principles that will be discussed in detail. Furthermore, in educational measurement, the objective tests may not reflect the properties of the variable that is measured since they allow students to either cheat on the test or guess answers; thus, measurement results would not be credible.
In order for an objective test to be reliable and valid, it is important to follow specific steps in its design. Pathak (2008) outlined the following steps of objective test construction: (1) planning the test; (2) administration of the test for preliminary try-out (this requires a relatively large sample of students or potential research subjects to take the test); (3) item analysis that includes instructing the testees, setting adequate time limits, scoring, selecting items for the final draft of the test, determining test validity and reliability, and forming the final draft of the test.
The ultimate goal of test construction for research purposes is validity. According to Fishman and Galguera (2003), the first consideration of the researcher in the process of the test construction is whether “the particular research question is about something that exists in nature”. Furthermore, even if the particular item of the test is designed to cover the existing real thing or real phenomenon, another consideration is whether this phenomenon is measurable. A test’s validity can be established only if the construct that the test is designed to measure is measurable. The third consideration of the researcher prior to test construction is thorough familiarity with prior research literature dedicated to the subject of the analysis. Specific attention should be paid to the construction of measurement instruments, in other words, to “the mechanics” of prior research. This should be done because as Fishman and Galguera (2003) pointed out “research as any form of inquiry ought to be a cumulative enterprise in which past findings suggest future areas of research by bringing to the surface lacunae in our understanding of the phenomena in question”.
An objective test item construction also deserves specific attention. Items that will be placed in the final version of an objective test should be easily understood and hence should be clearly written. Also, test items should be precise, inoffensive, and “not conducive to answering behaviors that are more related to total score than they are to the ultimate criteria that we are trying to predict” (Fishman & Galguera, 2003). The latter behaviors are, for example, those that favor the subjects’ self-image or self-concept. Furthermore, such responses also fall in the category of “socially desirable”.
Bias is the primary concern in test item construction. According to Lodico et al (2010), for the first time, biases in test administration received close attention in the 1970s and 1980s when results of the standardized tests revealed that there were gaps in performance between different racial, ethnic, and socioeconomic groups. These gaps were eventually attributed to biases in test construction, and two ways of identifying bias in tests were revealed. The first way called for examining test items in order to answer the following questions:
The second approach lies in examining test items in order to determine whether they may produce the difference in performance of certain subgroups (Lodico et al., 2010). It is insufficient to determine items that lead to different performance of different tested subgroups; rather, it is necessary to analyze performance of different subgroups on these items and the whole test. The reason for this is that different subgroups may perform differently on specific test items due to differences in prior instructions. Items are eliminated from the test if the subgroup’s performance on the test item is either better or worse compared to the subgroup’s performance on the test as a whole (Lodico et al., 2010).
Apart from the test itself, researchers can become a source of bias through test administration. The latter can be a source of bias if, for example, the test is administered by individuals who are of different race, ethnicity, or gender as the test takers in a situation when the test is designed to reveal peculiarities in educational performance of specific racial or ethnic groups. However, this condition is not crucial; although, it can improve the level of comfort of those taking the test. This condition is important in qualitative research, especially when data is gathered through personal interviews and interview questions are race- or gender-sensitive. Furthermore, it is essential that those who administer the test do not convey any prejudice towards test-takers in any form.
Lodico et al (2010) provide the following recommendations to researchers who differ in their background from the test-takers:
This study provided an analysis of measurement in educational research. This study defined measurement as a process that involves assigning of symbols to the properties specific variables or phenomena that are observed either directly or indirectly in accordance with the previously determined consistent rules and operations. The primary function of measurement is to serve as a mediator between empirical observations of concepts or phenomena and their mathematical expression. Scope of measurement is understood as the ratio of its magnitude to precision of measurement.
In this analysis, relation of measurement to scaling, statistics and measurement theory was discussed. The concept of a scale is derived from the definition of measurement and is formulated as a set of rules composed of certain conventions that are used to map the relation between empirical observations and numerical notations on a scale. Hence, measurement should not be simplified to mere quantification of empirical observations or real-world objects. Any instrument that is used for measurement purposes should satisfy two criteria: firstly, that is measures well what it is supposed to measure (measurement validity), and secondly, that measurement results obtained with this given instrument are consistent with the results obtained using a similar measurement process.
Tests are among the many measurement instruments employed in educational research. They can be used in quantitative, qualitative, and mixed research designs. The primary objective of the test construction in educational research is validity. However, a test may not be valid if certain bias is present. This research identified two primary sources of bias in the test that can be controlled for by the researcher. The first source of bias is the test itself that may contain items that discriminate against certain social groups. The second source of bias is test administration.