This article originally appeared in The Bar Examiner print edition, Spring 2019 (Vol. 88, No. 1), pp 46–49.
By Joanne Kane, Ph.D., and April Southwick
Test development for the Multistate Bar Examination (MBE) is both a science and an art. The full process relies on the effort of multiple teams of people with different backgrounds and areas of expertise. These teams balance an array of considerations in drafting new questions, in selecting questions for future examinations, and even in making decisions regarding the precise placement of the questions onto test forms.1 This article presents a behind-the-scenes look at some of the psychometric, drafting, and editorial processes driving the construction of each new MBE.
Drafting and Reviewing Items: A Coordinated Effort by Content Experts and NCBE Staff
Each new MBE question, or “item,” is initially drafted by an expert from one of NCBE’s seven MBE drafting committees (one committee for each of the seven MBE subjects). Experts are law professors, judges, or practicing attorneys with specialized experience in the subject area covered by the drafting committee.
Each new item then receives multiple rounds of review by content experts, both within and beyond NCBE.2 All new items are reviewed by a team of two external reviewers who were not involved in the drafting process. One reviewer is generally a judge or practicing attorney who may not have any particular expertise in the subject area being reviewed, while the other is a law professor who teaches in that subject area. Care is taken to ensure that a range of jurisdictions, law schools, and demographic characteristics are represented in the drafting and review processes.3
The various reviewers are not only evaluating the accuracy of each item (i.e., whether the question presents a coherent set of facts and has one clearly best answer) but also making judgments about whether the item is realistic and appropriate for a newly licensed lawyer. Reviewers also evaluate whether the item seems potentially biased in favor of or against a group of test takers. If there is any concern that a subset of examinees (based on race, ethnicity, age, or socioeconomic background, for example) might be at a disadvantage due to unfamiliarity with a construct-irrelevant4 word or concept (meaning that the word or concept is unrelated to the lawyering skills the test is designed to measure), the item is archived or revised and resubmitted for a new review.
In addition to relying on expert knowledge in writing items and judging their appropriateness for assessment of entry-level lawyers, NCBE staff members monitor developments in testing best practices and high-stakes testing specifically. We use findings from the measurement field to shape the drafting policies and guidelines used to write items. It might surprise some people who work in fields other than measurement to learn that there are (many!) academic textbooks dedicated to research and best practices in test development.5 These texts, written by academics and practitioners with expertise in testing and measurement, summarize research conducted across testing contexts and in neighboring fields such as linguistics and cognitive psychology as well as in other professions such as medicine, accounting, and engineering.
One handbook on item development, Constructing Written Test Questions for the Basic and Clinical Sciences,6 has played a particularly large role in developing item-writing guidelines at NCBE. One of the authors of the handbook, Dr. Susan M. Case, served as Director of Testing at NCBE from 2001 through 2013. While at NCBE, Dr. Case led an initiative to formalize and propagate evidence-based best practices in item writing, and these efforts continue.
Selecting Items: A Collaborative Effort Involving Multiple Stages of Decision Making and Review
After an item is pretested on an exam,7 its statistical and psychometric properties are measured and evaluated. Among the properties reviewed are the total percentage of individuals who answered the item correctly, the percentage of best-performing individuals who answered the item correctly (as measured by performance on the rest of the scored exam, overall and as a correlation), and the performance of the distractor8 options—individually and as a set. There are multiple indices for these basic properties.9 Any items with poor statistics (items that are too difficult, for example) are either archived or returned to drafting committees to be reviewed and revised. Items that make it past this initial check are put into the pool of “test form ready” items. But tests are not created by randomly selecting from the full pool; care must be taken to ensure that any prospective set of items satisfies multiple sets of criteria.
Selecting scored items for use on an upcoming MBE involves a collaborative effort between content experts and NCBE staff. The goal is to select a set of items that conform (individually and as a set) to both content specifications and statistical specifications. Selected items must cover a broad base of knowledge, as set out in the MBE Subject Matter Outline.10 There are multiple stages of decision making and review; after items are initially selected by NCBE’s attorney test editors for each drafting committee, the items must be reviewed and approved by the MBE program director, Testing and Research Department psychometric staff members, and drafting committee chairs. Finally, item sets are reviewed by the full drafting committee twice.
Procedures governing the selection of item sets are designed to hold psychometric characteristics, such as difficulty, constant over time. In addition to optimizing the psychometric properties of the exam, care is taken to maximize diversity in terms of content and the scenarios described. For instance, NCBE staff members work to ensure that there aren’t too many questions about car crashes—a common and salient scenario—in the interest of reducing potential confusion for examinees stemming from conflation across similar items and fact patterns.
Placing Items: Solving an “MBE Sudoku”
Once the set of items has been selected and approved, the individual items within the set may be placed onto an MBE test template. For approximately one week each winter and summer, NCBE Research and Testing staff members work together to solve an “MBE Sudoku” of sorts. The task is to achieve balance in features like overall difficulty, length, number of pretest items, and content areas (including sub-specifications) across the morning and afternoon halves of the exam. Optimization procedures are used to maximize the distance between items from the same content area and sub-content area, again partially in an effort to reduce or eliminate any potential confusion that could arise from items with similar scenarios or fact patterns appearing in close proximity. Staff members make heavy use of software to assist them in this optimization process, but human judgment comes into play as well.
As with the item-selection step, one situation requiring human judgment is deciding which scenarios or fact patterns might have surface-level similarities and then maximizing the distance between such items. Going back to the car crash example, efforts would be made at the item-selection step to minimize the number of scenarios involving car crashes on a given MBE in the first place. But if multiple car crash items were selected, care would be taken to place them as far apart as possible or practicable given other considerations. The optimization tools go a long way in reaching a solution, but occasionally tradeoffs are required.
After items are placed on test forms, editorial staff members carefully review every item on every form, word by word, for purposes of proofreading and content review. Testing staff members perform final checks that the morning and afternoon MBE halves are complete, correct, and consistent with previous administrations.
The purpose of this article is to give readers information on MBE test development through both editorial and psychometric lenses. At the heart of the various processes outlined in this article are central and overarching values, like fairness. Staff members work hard to ensure that the psychometric attributes of items and full tests are held steady over time in the interest of fairness. Multiple layers of expert review by law professors, judges, practicing attorneys, editors, proofreaders, and psychometricians combine to help ensure that the individual elements of the MBE and the MBE as a whole are in conformance with not only NCBE policy but testing best practices.
- The term test form refers to the collection of scored and unscored (pretest) items in the order in which they are presented to examinees. (Go back)
- See C. Beth Hill, “MBE Test Development: How Questions Are Written, Reviewed, and Selected for Test Administrations,” 84(3) The Bar Examiner (September 2015) 23–28. (Go back)
- Ibid. (Go back)
- To illustrate the term “construct-irrelevant,” objects like chandeliers might be more or less familiar to examinees depending on their individual backgrounds, including socioeconomic status. Familiarity with such objects is not related to measuring lawyering skills per se and would be considered construct-irrelevant in the context of the bar exam. In contrast, whether an examinee understands what a contract is and can understand a question involving contracts is directly relevant to measuring lawyering skills. Testing whether examinees know about contracts is clearly construct-relevant. A question about a contract for chandeliers might seem like fair game as long as it was not necessary to have any special knowledge of what a chandelier is or does to be able to correctly answer the question; however, the goal in creating the MBE is to make the questions as clear, straightforward, and fair as possible for all takers. The goal is to avoid the potential for a construct-irrelevant concept to decrease the likelihood that an examinee could answer a given question correctly and/or to increase the time spent on that question. (Go back)
- Representative titles include Thomas M. Haladyna and Michael C. Rodriguez, Developing and Validating Test Items, Routledge (2013); Steven M. Downing and Thomas M. Haladyna, Handbook of Test Development, Routledge (2006); and Thomas M. Haladyna, Writing Test Items to Evaluate Higher Order Thinking, Pearson (1996). (Go back)
- Susan M. Case and David B. Swanson, Constructing Written Test Questions for the Basic and Clinical Sciences, National Board of Medical Examiners (3rd ed., 2002). (Go back)
- All new items are pretested to evaluate their performance before being used as live scored items. Of the 200 questions on each MBE, 25 questions are unscored pretest items. Only the 175 live items contribute to an examinee’s score. (Go back)
- The term “distractor” refers to each of the incorrect response options on a multiple-choice item. Each four-option MBE item has three distractor options and one correct option. (Go back)
- Indices include measures based on both Classical Test Theory (CTT) and Item Response Theory (IRT). (Go back)
- The 2019 MBE Subject Matter Outline is available on the NCBE website at http://www.ncbex.org/pdfviewer/?file=%2Fdmsdocument%2F226. (Go back)
Joanne Kane, Ph.D., is the Associate Director of Testing for the National Conference of Bar Examiners.
April Southwick is the Multistate Bar Examination Program Director for the National Conference of Bar Examiners.
Contact us to request a pdf file of the original article as it appeared in the print edition.