Securing Higher Grades Costing Your Pocket? Book Your Assignment Help Tutor at The Lowest Price Now!

Get Instant Assignment Help

div id="seobox1" class="span11 aug-box">

Reliability and Validity: Psychological Assessment and Measurement

To begin with there are many desirable properties of Psychological measures viz: Standardization, population norms to name a few.

Reliability and Validity are considered linchpin to any psychological measurement which lead to interpretability of individual and group scores. Apart from its above stated purpose, they are also the requisite for Clinical Assessment and Diagnosis.


To begin with, reliability refers to the consistency of scores obtained by the same persons when they are re-examined with the same test on different occasions, or with different sets of equivalent items or under other variable examining conditions. It is the ability to yield consistent results from one set of measures to another; it is the extent to which the obtained test scores are free from such internal defects as will produce errors of measurement inherent in the items and their standardization.

These errors do not just come from the instability of the subject’s performance or because of chance factors but since individuals do not perform with complete consistency upon all times and chance factors cannot be completely eliminated the reliability for psychological tests are the result of interaction through individual differences, defects of the instrument and chance determinants.

Reliability has two important factors: internal consistency i.e. consistency of results throughout the test when administered once. The other one is the consistency of results upon re-testing.

Whenever anything is measured whether in the physical biological or behavioral sciences there is some possibility of chance error. This is true of psychological test as well. Variations of results obtained with the same test administered more than once using the same persons as subjects or within the part of the test given only once are due to actual differences among individuals taking the test and whatever defects may be inherent in the instrument itself.

Types of Reliability

Below mentioned are certain types of Reliability, which are as follows:

Absolute Reliability

It is stated in terms of standard error of measurement (SEM), an estimate of deviation of set of scores from true scores.

SEM = true scores – observed scores.

Relative Reliability

It is expressed in terms of coefficient of correlation. It indicates the extent to which to which individuals in a group maintain relatively consistent scores when two sets of measures are correlated using the same test or two equivalent forms.

Test – Retest Reliability

It is the extent to which a group of individuals obtain scores on a test correlate on the scores on the same test upon retesting. Variations may occur in the result in part from uncontrolled conditions like weather, or in some cases they arise from changes in the conditions of the subject himself like illness, strain etc. This type of test is affected by practice and memory.

Alternate Form Reliability

In this, the subject can be tested with one form on the first occasion and with another equivalent form on the second occasion. The correlation will represent the reliability coefficient of the test. The use of alternate forms provides a means of reducing the possibility of cheating and faking.

Split Half Reliability

Two scores are obtained for each person by dividing the test into comparable halves. Coefficient of correlation between two equivalent halves of test for a sample group of individuals gives split half reliability. A prerequisite for using split half technique is that the items shall have been arranged in their order of increasing difficultly as determined of individuals in the standardization group.

The Spearman–Brown formula, also known as the Spearman–Brown prophecy formula, is a formula relating psychometric reliability to test length and used by psychometricians to predict the reliability of a test after changing the test length. The method was published independently by Spearman (1910) and Brown (1910).

Reliability in Quantitative Research

Qualitative and quantitative researchers want reliable measurement. However, each style sees reliability in the research process differently. Reliability in quantitative research is knows as Measurement reliability, means that the numerical results produced by an indicator do not vary because of the characteristics of the measurement processor measurement instrument itself.

Example: I get on my weight scale and read my weight. I get off and get on again and again. I have a reliable scale if it gives me the same weight each time assuming of course that I am not eating drinking, changing clothes and so forth. An unreliable scale will register unreliable (different) rates each time even though my “TRUE” weight does not change.

Reliability in Qualitative Research:

Reliability means dependability or consistency. Qualitative researchers use a variety of techniques (like interviews. Participation, photographs, etc) to record their observations consistently. Researchers want their observations to be consistent overtime, however, the process that they study is often not stable over time. They emphases the value of a changing or developing interaction between the researcher and what he/she studies. Qualitative researchers believe that the subject matter and the researcher’s relationship to it should be an evolving process. Most qualitative researchers see the quant. Approach to reliability as a cold fixed mechanical instrument that once repeatedly applies to some static or lifeless material. They feel it neglects key aspects of diversity that exist in the social world. The diverse measures and interactions with different researchers are beneficial because they can illuminate different facets or dimensions of a subject matter.

How to improve reliability?

Four ways to increase the reliability of measures are:

  1. Clearly conceptualized constructs: constructs should be specified to eliminate distracting or interfering information from other constructs. Reliability increases when each measure indicates one and only one concept based on unambiguous and clear theoretical definitions.
  2. Use a precise level of measurement: indicators at a higher or more precise level of measurement are more likely to be reliable than less precise measures because the later picks up less detailed information. If more specific information is measured, then it is less likely that anything other than the construct will be captured.
  3. Use multiple indicators: a third way to increase reliability is to use multiple indicators because two or more indicators of the same construct are better than one. Multiple indicators let a researcher take measurements from a wider range of the content of a conceptual definition. Different aspects of the construct can be measured each with its own indicator.
  4. Use pilot tests: reliability can be improved by using a pre-test or pilot version of a measure first.


Range of Ages

A correlation coefficient reflects the group trends of the measures. As persons increase in age, mental capacity increases until maximum development is reached. If a reliability coefficient is found with a group that has a relatively small variation of the trait or ability being measured, the coefficient will be relatively low. If the group has a wider range, the coefficient will be higher. In interpreting a reliability coefficient of a test, it is important to know the range of ages upon which the test was standardized.

Range of Scores

Just as, in ranges of age, variation among the subjects is narrow, the correlation between two sets of scores may also be lowered by chance and minor psychological factors. Reliability coefficients of a given test may vary as the composition of the tested group changes, even though the performances of the subjects themselves are unchanged.

Time Interval

When there is a time interval in administering two forms, the retest results will be affected by the normally expected fluctuations in individual performances and by changes in environmental conditions.

Thus, while reliability coefficients obtained at a single sitting, or in a single day, are most likely to estimate best the consistency of the instrument itself, they do not indicate stability of performance over a period of time as well as do coefficients obtained by the test-retest method, using time interval.

Effects of Practice and Learning

Such effects will depend upon the content of the test, length of the interval, and upon the examinee’s experiences during the interval. For example, if some months have elapsed between two administrations of an educational achievement test, different people may have had different amounts and qualities of instruction during the period.

Consistency of Scorers

Some tests are not entirely objective in scoring as the examiner would want to judge the quality of responses. For such tests, it is important to know the extent of agreement in scoring among competent psychologists who have scored the same sets of responses. Lack of agreement among scorers will adversely affect the reliability findings.

Looking for online tutors in abnormal psychology or online tutors to teach you Psychometrics? Get instant help with online psychology tutors of


The merit of a psychological test is determined first by its reliability but then ultimately by its validity. Put simply, the validity of a test is the extent to which it measures what it claims to measures. Psychometricians have long acknowledged that validity is the most fundamental and important characteristic of a test. The definition of validity paraphrased from the influential Standards for Educational Psychological Testing: “A test is valid to the extent that interferences made from it are appropriate, meaningful and useful.”


FACE VALIDITY: A type of measurement validity in which an indicator “makes sense” as a measure of a construct in the judgment of others, especially in the scientific community. This is a term used to characterize test materials that appear to measure what the test’s author desires to measure.

CONTENT VALIDITY: Measurement validity that requires a measure to represent all the aspects of the conceptual definitions of a construct. It is estimated by evaluating the relevance of test items, where each item must be a sampling of information the test purports to measure.

CRITERION VALIDITY: Measurement validity that relies on some independent, outside verification.

CONCURRENT VALIDITY: measurement validity that relies on the preexisting and already accepted measure to verify the indicator of a construct. It indicates the process of validating a new test by correlating it with some present source of information.

PREDICTIVE VALIDITY: Measurement validity that relies on the occurrence of a future event or behavior that is logically consistent to verify the indicator of a construct. It is the extent to which the test is efficient in forecasting and differentiating behaviour in a specified area under actual living conditions

CONSTRUCT VALIDITY: It uses multiple indicators to verify the authenticity of a research and has two subtypes which explain how well the indicators of a construct converge or diverge. It indicates the extent to which a test measures the psychological processes as defined and analyzed by the author of the test.

FACTORIAL VALIDITY: This method utilizes factor analysis techniques. A test has high factorial validity if it is a measure of one functional unity to the exclusion of other elements.

CONVERGENT VALIDITY: A type of measurement validity for multiple indicators based on the idea that the indicators of the construct will act alike or converge.

DISCRIMINANT VALIDITY: A type of measurement validity for multiple indicators based on the idea that the indicators of different constructs diverge.


‘VALIDITY’ refers to as ‘truth fidelity’ or authenticity of any psychological instrument. In psychological research, it is referred to as ‘MEASUREMENT VALIDITY’.

MEASUREMENT VALIDITY: it explains how well an empirical indicator and the conceptual definition of the construct that the indicator is supposed to measure “fit together”. It refers to how well the conceptual and operational definitions mesh with each other. The better the fit, the greater the measurement of validity.

Validity is part of a dynamic process that grows by accumulating evidence over time, and without it, all measures become meaningless.


  • Qualitative research is more interested in the idea of validity referring to authenticity rather than truth fidelity alone.
  • AUTHENTICITY means giving a fair, honest and balanced of social life from the viewpoint of someone who lives it everyday.
  • Most qualitative researchers concentrate on capturing the inside view and providing a detailed account of how those being studied understand events.
  • Qualitative researchers try to create a tight fit between ideas and statements about the social world and what is actually occurring in it.


A validity coefficient is a correlation between test score and criterion measure. Because it provides a single numerical index of test validity, it is commonly used in test manuals to report the validity of a test against each criterion for which data are available.

The data used in computing validity coefficients are usually tabulated in expectancy tables or expectancy charts. Such tables provide a convenient way to show what the validity coefficient means for the person tested. It is called expectancy since it gives a probability that an individual who obtains a certain score on the test will attain a specified level of criterion performance.


Following are the factors affecting validity:

NATURE OF THE GROUP:The same test may measure different functions when given to individuals who differ in age, sex, educational qualification, or any other relevant characteristics. This affects the validity of a test.

SAMPLE HETEROGENEITY:Since validity is reported in terms of correlation coefficient. Sample homogeneity needs to be checked for every context. Wider the range of scores, higher will be the validity. Therefore, it is important that the sample selected is quite homogenous.

PRE-SELECTION OF THE SAMPLE:Pre-selection is likely to cause fluctuation in validation measures since early or beforehand selection of a sample is likely to restrict the scope of validation to a select few.

FORM OF THE RELATIONSHIP: The nature of relationship between data as either bivariate or in the form of a scatter diagram is likely to give different vales for the validity coefficient.


Both measurement reliability and validity need to be in sync and coherent for the research to be accurate, authentic and empirical. It is important to note that a reliable measure may or may not be valid. But, valid measure is essentially/necessarily reliable. Therefore, reliability is necessary for validity. Because an unreliable measure will certainly be invalid. If people receive different scores on the same test everytime they take it, such a test is not likely to predict anything. Ergo, even if a test is a reliable measure it is not mandatory that it will be a valid measure too.


  • An individual raw score on the test makes no sense. It is evaluated by comparing it with scores obtained by other people.
  • Norms are indicative of normal or average performance and they indicate about the superiority or inferiority.
  • During standardization, the test is administered to a large representative sample of the type of people for whom it is designed.
  • Norms are therefore the bases for comparison. They help in comparing an individual’s performance with respect to other people of the same group.



Percentile scores are expressed in terms of the percentage of person in a standardization sample who fall below a given score. Percentiles are derived scores in terms of percentage persons. Whereas percentage is a raw score.

Another different concept is that of Percentile Rank (PR). Percentile Rank is a score in a standardization sample below which a specific percentage of cases fall. A percentile rank can obtain any value since it is a score. A percentile on the other hand can get the highest value of 100 since percentages cannot be more than 100%.


The decile scores are on the same principle as the percentile but instead of designating one-hundredth part of a distribution, it designates one-tenth part of the group. (N/10). “Decile Rank” signifies a range of scores between two dividing points. Decile scores are generally used in a distribution where the number of scores is small and where percentiles cannot be used.


Standard scores express the individual’s distance from the mean in terms of the standard deviation of the distribution. There are three subtypes of standard scores. These are:


Z Scores are a variant of standard scores with mean holding a value of 0 and standard deviation equivalent to 1.


This term was suggested by McCall. Here, the mean is set to be 50 and the standard deviation is 10.


Under stanine scores, the standard population is divided into nine groups. The underlying basis for obtaining stanine is that the normal distribution is divided into nine intervals, each has an interval of 0.5 standard deviation. It is therefore known as stanine or Standard Nine.


The concept of mental age was first given by Binet. He defined mental age as a quantitative unit which could be computed by summing an individual’s basal years and the additional credits obtained by him. Basal years was the milestone where the subject passes all the items irrespective of other factors. The concept of Deviation IQ was given by Weschler which gives an account of the subject’s score deviation from the mean. It has a mean of 100 and a standard deviation of 15. It caters to the deviant IQ Scores from the average.


For any psychological instrument (especially tests) or research, the psychometric properties are extremely important. Reliability, Validity and Norms help as active guides to objectify the data and information obtained by a researcher in the process of research. Reliability and validity give an index of an instrument’s consistency and authenticity. Therefore, any psychological instrument or research that is not reliable or valid, holds little importance in the discipline. Both are therefore needed for an instrument or a research to be acknowledged as empirical and authentic. Norms help in comparative analysis. They facilitate comparison of an individual to his group on some common grounds. Without norms, raw scores alone don’t have any importance. Without getting a holistic view, comparisons cannot be made and hence misconstrued a skewed data may be obtained. Thus, for any research or a psychological instrument to be recognized as scientific and empirical, it is important that these psychometric properties are well-defined and certify the instrument as acceptable.


Anastasi, A. Psychological Testing. 2003. Prentice –Hall publishing.
Freeman, F.S. Theory and Practice of Psychological Testing. 2008. Oxford & Ibh Publishing Co. Pvt Ltd
Neuman, W.L, Social Research Methods. 2010. Pearson.

Assignment Help Features
Assignment Help Services
QR Code Assignment Help