Skip to main content

The Evolution of the NIH Toolbox with Richard Gershon, PhD

As the primary investigator of The NIH Toolbox®, Richard Gershon, PhD, has led a team of hundreds of scientists to develop and evolve state-of-the-art digital assessments of cognition, motor, sensation and emotion which has been used by clinicians, investigators and academics for nearly two decades.

In this episode, Gershon explains how the third version of The NIH Toolbox® app developed for the iPad, provides test batteries for cognition, motor, emotional and sensory functioning in individuals aged three to 85 and may soon be used in infants as young as one month old.

“It’s exciting seeing that these tools will have a life that will…outlive even my academic career, that will have a real use and really benefit kids and adults.” - Richard Gershon, PhD 

  • Chief of Outcome and Measurement Science in the Department of Medical Social Sciences 
  • Professor of Medical Social Sciences in the Division of Outcome and Measurement Science 
  • Professor of Preventive Medicine in the Division of Health and Biomedical Informatics 
  • Member of Northwestern University Clinical and Translational Sciences Institute (NUCATS) 

Episode Notes 

The NIH Toolbox®, initially designed for research but validated for clinical use, enhances efficiency, reliability and accessibility in assessment testing. Its applications include cognitive assessments in Long COVID patients, early detection of cognitive decline in Alzheimer's disease, assessing attention and memory in infants and large-scale studies on environmental impacts on child cognitive development.  

  • The NIH Toolbox® for Assessment of Neurological Behavior and Function, a project funded by multiple institutes within NIH, was designed to provide a common measurement system that could be used across a variety of studies. It led to the creation of fast, separate test batteries for cognition, motor, emotional and sensory functioning, originally intended for research purposes but eventually validated for clinical use due to their reliability. 
  • The advanced technology used by the NIH Toolbox® enhances the efficiency and reliability of testing. It also facilitates the application of computer adaptive testing, which zeroes in on an individual's ability level, significantly reducing the testing time across numerous areas. 
  • Gershon, along with a large group of global experts, developed the NIH Toolbox® over five or six years, initially as a web-based product but later transitioned it to an iPad version for portability. The NIH Toolbox®, now in its version 3.0, is used in over 1,100 institutions. 
  • The use of the NIH Toolbox® varies depending on purpose, with most users opting for age-related cognitive assessments under the supervision of a test administrator. This digital assessment platform automates most of the process, reducing the impact of the test administrator, with features like an automatic generation of reports, which can then be uploaded directly into the test-taker's electronic health record system. 
  • While the primary users of the NIH Toolbox® are currently investigators. In the near future it is expected that clinicians will become the median users. The Toolbox® is being utilized in research to track the health and cognitive impacts of chemotherapy on children or the effect of environmental toxins on child cognitive development.  
  • Igor Koralnik, MD, at Northwestern Medicine, is using the NIH Toolbox® to assess cognitive function in patients with Long COVID, adjusting results for age, education and other demographics. Depending on the extent of cognitive deficits, patients may be sent for cognitive rehabilitation, with retests conducted annually to track changes in cognitive functioning. 
  • The NIH Toolbox® is also being used to assess Alzheimer’s disease. Gershon and his collaborator, Sandra Weintraub, PhD, have validated the NIH Toolbox® for use in detecting cognitive decline and dementia. A few tests from the Toolbox® can reliably indicate if a person's cognitive functioning is deteriorating more rapidly than normal aging, potentially flagging the need for further diagnostic procedures. 
  • The NIH Toolbox® is leveraging technologies like LiDAR to assess attention, memory, and motor skills in babies, helping detect early need for interventions, with the ongoing project also undertaking a norming process to establish typical functioning ranges. 
  • Gershon highlights the tool's efficiency, self-scoring capability, and affordability, with minimal training required for administrators, making it an economical choice for large-scale studies. 
  • The NIH Toolbox®, currently translated into about 10 languages and designed to be largely language-agnostic, is already being utilized globally, including in low and middle-income countries, with potential for further expansion to help children and adults worldwide through native language adaptations. 
Additional reading: 

Recorded on May 24, 2023.

[00:00:00] Erin Spain, MS: This is Breakthroughs, a podcast from Northwestern University Feinberg School of Medicine. I'm Erin Spain, host of the show. Assessing someone's behavioral and neurological function is faster and more convenient than ever before with technology developed by Northwestern Medicine investigators and funded by the National Institutes of Health. Here to discuss the NIH Toolbox®, which is in its third version on an iPad app, and how it's being used in people ages 3 to 85, is Dr. Richard Gershon, the Vice Chair of Research and Professor in the Department of Medical Social Sciences, and a Professor of Preventive Medicine in the Division of Health and Biomedical Informatics at Northwestern University Feinberg School of Medicine. He is the longtime principal investigator of the NIH Toolbox® Project and joins me today. Welcome to the show. 

[00:01:05] Richard Gershon, PhD: Thanks, Erin, for having me. 

[00:01:06] Erin Spain, MS: Tell me about your background and how your career has really increasingly focused on the development of these modern assessment tools such as the NIH Toolbox®. 

[00:01:16] Richard Gershon, PhD: Since graduate school, I was fascinated with how we could modernize assessment and apply technology. And I really was a little ahead of my time because we didn't have laptop computers, or the first laptop computer I used was $12,000 and the screen color was pink and red. And it seemed to me that we could improve assessment by using a computer. Originally I did it actually for what's now the Shirley Ryan Institute back when I was a graduate student working with Dr. Allen Heinemann there. And they needed a way to assess patients who are bedbound. And unfortunately, for some period of time, they may not have use of anything but a finger or other ways of doing that. So I put surveys on a laptop, which at the time was unheard of because a laptop you could only run a spreadsheet or a word processor. So we faked a spreadsheet to look like a survey tool and that led to a very long... what's been my entire career of working at the intersection of technology and assessment across a really broad swath of areas.  You know, my work at Feinberg for over 10 years at this point, has been very much based on either the NIH Toolbox® in various increments from mobile phones to using it with babies we're coming out with next year, and then with the Patient Reported Outcome Measurement Information System, which is a project that's headed up by Dr. David Cella. I've worked with him and that's to assess outcomes in healthcare. So I've always been playing in that space. We don't want the technology to be an impediment. We want it to be a tool. 

[00:02:42] Erin Spain, MS: This was something you always thought that could be improved on or could be made better, assessing folks. Explain to the audience what exactly is the NIH Toolbox®.  

[00:02:52] Richard Gershon, PhD: So the NIH Toolbox® for Assessment of Neurological Behavior and Function, is the official long name. It's a contract issued by the NIH Neuroscience Blueprint, and those are 13 institutes at the NIH that all do neuroscience research, and at the time they tasked each other to come up with a fund of projects that they could all benefit from. So this was funded by the National Institute of Neurological Disorders and Stroke, as well as the National Institute on Aging, as well as the National Institute of Child Health and Development. They all got together and said, we need a common measurement system that all studies could use. They called it a common currency, because people were doing studies and showed that a person has better executive functioning here or worse there. But there were two different tools being used by different researchers. And they're not fully comparable. So, ultimately we created separate test batteries for cognition, motor, emotional, and sensory functioning, all under 30 minutes for each of those, and individual tests range between one and four minutes. So very, very fast. We originally created it for research purposes, because how could you test something in one to four minutes and use it clinically? But what we found over the course of the last decade is people were using it clinically and have validated it for clinical use because the reliability we get is the same as tests that are much, much longer. I think computers are tools, and any process that repeats a computer can do it faster, and for things like the NIH Toolbox®, what it can do better is it can automate timing. It can do millisecond or multi millisecond level timing. So when we're assessing executive functioning skills and how fast does the brain respond to stimuli, it used to be use a stopwatch and you watch a person, but actually the person watching the person, they hit the stopwatch more slowly. So we have this unreliability. The computer can literally record exactly what's happening. And we can use it to do something that I really, where I came out of my specialty as a graduate student at Northwestern in the psychology department is computer adaptive testing. We literally zero in on a person's ability level. So we've applied that in the NIH Toolbox® for things like PROMIS. Our vocabulary test actually, it's a computer adaptive test and in 21 items gives you a clinically valid assessment. That's three or four minutes of vocabulary, and we don't know where you started. You could be a three-year-old, you could be a PhD in English. In three or four minutes we're there. On a paper test that corresponds to someone picking the right test and then it taking half an hour to an hour. So we do it in four minutes, and we apply that to lots of areas to make the testing much faster.  

[00:05:27] Erin Spain, MS: So this was about 17 years ago that you were awarded this contract, became the PI of this project. Tell me how it's developed over the years to what it is today with this version 3, it's on an iPad app, and there's even more coming. 

[00:05:41] Richard Gershon, PhD: So we originally put it together. I was gonna say I got 254 of my closest friends from around the world, and I didn't know most of these people at the time. We wanted the experts for each test. So an expert in olfaction, smell test, which actually has become relevant later on for COVID, an expert in vocabulary. We got all these people together, developed this over the course of five or six years, put it together. Originally was a web-based product. We learned later on during the course of the National Children's study, they wanted to assess a lot of children. They really needed something that was portable, so we migrated it to the iPad. Don't like necessarily doing something that's dedicated to a particular vendor, but the iPad itself is really a commercial level device. It's very consistent when it comes out. So I don't have to check that the timing works on this version versus that version. It's highly durable. And so we came out in the iPad version. It was very much based on the original web version. This version 3.0, which we just released, was in an attempt, let's fully optimize the NIH Toolbox® for use on an iPad. So we fully take advantage of the timing resolution of the iPad, the features in the iPad. So it's quick, it's easy, we operate on the least expensive iPad you can for the NIH Toolbox®. People are using it now. We're in 1100 institutions. And those are primarily research NIH funded, but we also validated the version 3, which means we confirmed that was useful in clinical populations, so clinicians could use it and also in school populations. So like for ADHD and concussion, because people were using it clinically, even without our saying it should be used that way. But now we've confirmed that it works that way as well. 

[00:07:19] Erin Spain, MS: So describe what an assessment might look like. Someone has the iPad, they bring someone into a room and sit down with them. Walk me through that. 

[00:07:27] Richard Gershon, PhD: So usually it depends on what the use is. We have very few people who give all 50 tests because you don't always need sensory tests and emotion tests, etc. Our primary usage is in cognition, and most people give pretty much a full battery that's age-related. But so a person, let's say, will sit down first, you'll take that vocabulary test I mentioned before. They'll take it under the supervision of a test administrator. This is designed for use with an administrator there. Therefore, we confirm it's the right person. And they can answer questions. We have versions of the NIH Toolbox® that are for self-assessment, but the version we just released and it's most popular is this iPad with a test administrator sitting next to you, although you do most of the work. One other feature of the Toolbox® is, and it's a problem when you give assessments, is training the people who give the tests. Very often the test results they would get on the same person would be different. And what we did with the Toolbox® is we minimized the impact of the test administrator. So while we want them there to answer questions, the iPad generally gives the instructions. I'm sitting down, see the vocabulary test. It says you're about to see some words. A little video snip how you're gonna answer them. It has a practice item, with four pictures on the screen. I think one of the first ones is "spoon." "Press the picture of spoon." There are four choices. They press it correctly, it says correct. Incorrectly, it gives them some more practice, and then it gives this computer adaptive test. We almost always know a person's age level, so we don't give you a kindergarten item if you're a 12th grader, but we zero in and then it immediately goes the next test. We have a reading test. The person gets a word on the screen. They see the word, the test administrator interprets it, did they get it right or wrong? They've actually got a wireless keyboard. They're keying in right or wrong. It's taken into consideration by the iPad. Okay, if they know that word. Let's give a more difficult or easier one. Things like emotional health are really self-assessed. Depending on the purpose, you could have a person taking two tests. You could have them taking 40 tests. It's all automated, it's all set up. It'll give you a printed report when it's over. It normed, how do you do relative to other 17 year olds? Or also un-normed for research purposes. And then the results can be automatically uploaded to someone's electronic health record system. 

[00:09:33] Erin Spain, MS: Tell me what happens from that point when the assessment is over, the report is generated. What happens then and how does it impact patients and kids and people of all ages? 

[00:09:43] Richard Gershon, PhD: I think I mentioned earlier it's in use in 1100 institutions, and to some degree we actually don't fully know who's using it. Part of our security scheme is it's distributed by Apple, and Apple doesn't tell someone who distributes an app who's using it. We get a lot of feedback from people. So the primary user is a researcher today. The median user next year or two years from now will be a clinician. So let's talk about both uses. Let's take an example right now: the Toolbox® is being used to track the health of people with Long COVID, and what is the impact of Long COVID on cognitive functioning. So, Dr. Igor Koralnik, who's the chief of Neuro-infectious Disease and Global Neurology here at Feinberg, has a group of people when they come into his clinic, take the cognition battery, they look at the scores. At that point, they could compare them to where they are versus others, but they then measure them a year later. Have these people improved? Have their scores begun to look like they're more normal than they were before? Because Long COVID, unfortunately, is impacting people's cognitive functioning. So there we're talking about group level effects and also looking at individuals. They wanna do both. Other things that are being used, for instance, like St. Jude's Hospital, is tracking... They give to kids when they are having chemo and then they track them every year. There's at least one study doing this there. They see, is chemo impacting their cognitive functioning? And so they're doing that to track it. And that's actually there for individuals. On a group level, other studies are like environmental child health outcomes tracking now 70,000 kids, birth through adulthood. And it turns out that kids who grew up in areas, you know, heavily exposed to toxins have on average lower cognitive functioning that people grew up in area without that. And lots and lots of hypotheses. But the trick is, old school of doing this is you had someone take a $3,000 neuropsych eval, which is too expensive, too time consuming. And with NIH Toolbox®, we can train a college graduate four to eight hours of training and four to eight hours of practice. And they can give this and get clinical level results. So again, it's a tool. Very rapid. It's self-scoring. Almost any other test system you have to score. We don't charge per subject. We charge a single fee once a year. At this point, it's based on the number of iPads that it's used on. It's about 20, probably 25 cents on the dollar. And that just as enough money for us to provide tech support. So it's very economical, particularly in research settings. 

[00:12:07] Erin Spain, MS: I want to go back to how this is being used in Long COVID at Northwestern Medicine and Dr. Koralnik. Is he actually using the NIH Toolbox® to diagnose patients with Long COVID? 

[00:12:19] Richard Gershon, PhD: Yeah, he is. A patient again comes into his clinic, they take the NIH Toolbox® cognition measures, they adjust them for age, education and any other demographics, so they get a sense of where this person's cognitive functioning should have been. Because right, they only see somebody after they've been exposed to COVID. Don't really know what their functioning was before COVID, except in rare instances. And then, if they have extreme deficits, they can immediately be sent for cognitive rehabilitation. And then as they see them annually, they go back, retest these people, and hopefully they're seeing improvements. They could also see people getting worse. And most of us are going to have pretty stable cognitive functioning from early adulthood until our sixties. And then unfortunately, some of us will all coast a little bit. And if you're coasting a lot, that's a problem. And unfortunately, some people with Long COVID are coasting down much more rapidly than we would hope. 

[00:13:13] Erin Spain, MS: Is this also being used in Alzheimer's disease assessments? 

[00:13:17] Richard Gershon, PhD: Very much so. The interesting thing with being funded by all these different institutes at the NIH is each of them has said, "Wait, wait, wait, what about using it here?" Now the majority of validation studies being done and the uses of the NIH Toolbox® are not Northwestern, which is good. I say we are making the instruments for the orchestra. We're not writing the music, we're not conducting the orchestra. And then having said all that, of course, I'm working with my collaborator, Sandy Weintraub, who's a researcher at the Mesulam Center for Cognitive Neurology and Alzheimer's Disease. And she and I have partnered on a series of projects. I should first point out, she was the head of the cognition development team for the original NIH Toolbox®. And since then we have validated the NIH Toolbox® for use in aging, super aging, Alzheimer's disease, to detect dementia, to detect cognitive decline. But in Alzheimer's disease in particular, we were able to determine that all the tests are sensitive to somebody with cognitive decline, but with giving just a couple of tests, we can, with a high level of reliability, detect that someone's cognitive functioning is deteriorating much more rapidly than they should if they're undergoing normal cognitive aging. And right now the only way to confirm some of Alzheimer's disease, I believe, is an autopsy of their brain. So we don't wanna wait for that one, or to do a PET scan, which again are thousands of dollars. But if we can do this with a couple of tests in the clinic? I'm working with several studies with Mike Wolf in preventive medicine and some clinical trials at this point, to have people take a couple of the tests from the NIH Toolbox®, literally in the waiting room and communicate to the doctor if there's a suspicion of cognitive decline or dementia. Now, we can't diagnose with this, but we can say, there's a good reason to do more exploration here. Versus a person saying, you know, my cognitive functioning doesn't feel well. Turns out people are not really good judges of whether or not they have dementia.  

[00:15:08] Erin Spain, MS: On the opposite end of the spectrum in young children this test can be used for screening at schools to determine if a child has special needs, if a child needs accommodations, things like that. Tell me about how this is being used in children and how you plan to even use this in babies in the future. 

[00:15:25] Richard Gershon, PhD: So the current NIH Toolbox® was originally created for 3 to 85 year olds. We have recently done work for 85 to 95, so we're sure it works up there. For younger children, it's being used to detect intellectual disabilities at early ages. I think there's some early research being done by others, can we detect kids who need early intervention? Which is a popular thing, and how do you do that inexpensively, right? Because we wanna do that for every kid, every child, before they go to school. And so the NIH Toolbox® is used there. Again, with the Environmental Child Health outcome study, a lot of research will be published in the coming years of impacts of food deserts or types of education. Frankly, the impact of COVID on child development. I mean, a lot of children, their reading levels are off, things like that. The NIH Toolbox® allows people to assess thousands of children to do that other than a study of a hundred. So it's really interesting, at the time the NIH Toolbox® came out, National Institute of Child and Health Development said, we want this to go down to age one month, And so now we're sitting out here, I think about a dozen years later, and they issued a contract request for proposals, and we responded, and we are now working on that, we're in year four. And so they want to do the types of testing for the NIH Toolbox® down to one month of age and overlapping up to four years of age. Which is good, so that it'll be a handoff, so to say. And so now we're actually really taking advantage of the iPad. I'm actually glad we waited because the latest iPads we're using, for instance, we're checking children's attention and executive functioning, even memory, for babies by having, it's called a LiDAR camera. So LiDAR is a kind of laser beam. It's in our smartphones. If you have a smartphone that opens up the camera based on your face, it's actually sending 15,000 laser beams at your face. They're invisible. They're not ones that are gonna hurt you, and measure your face. So we're using that technology to see where the baby is looking on the screen. So if we put two boxes on the screen and show them we're putting a toy in one box, and not in the other box. Then we put the two boxes back. The child should be looking at the box where we put the toy. Now we can't have them touch anything, they can't do it, but we're using this LiDAR and examining their pupils and the shape of their face and is it looking at the box or looking somewhere else? And we were applying that type of technology to quality of walking, quality of standing. Things that are much more of interest and of concern for babies, social functioning interactions, how does a child grasp things? And using it again, automated scoring. We sent out a survey to people who do research in the 0-3 space, 450 researchers responded. 450 people said, I need this. I want this. I need it to be inexpensive. It can't take four weeks of full-time training, which many of these batteries do. Let it score it automatically for me. Let it show me how to assess a person. And we're actually, right now, we started norming that. So norming is a process of figuring what is the normal range of functioning? Because functioning ranges a lot, and in babies, ranges even wider. So we're out there testing 3,000 one-month to 48 month olds. We're out there testing at 12 centers around the country, rural, urban, children of all socioeconomic levels, balance for race, ethnicity, and level of parental education, and giving all the tests in the baby Toolbox®. And we'll be able to tell somebody giving the test: this child's above the normal range, below the normal range, and therefore it could be used clinically again. And then for studies comparing exposures to environmental toxins, does this group do better or that group do better. And those kinda studies, we don't wanna find differences, but we need something to detect them. 

[00:19:01] Erin Spain, MS: Do you think there could be a global application for this in low and middle income countries someday? 

[00:19:07] Richard Gershon, PhD: Yes, it's already being realized. The NIH Toolbox® so far has been translated into about 10 languages. It's being used in research studies. Prior to our translating it even, people were using it in Africa and research studies, turning off the volume and pretending they were the voice in the computer and doing this. It's being used by pharmaceutical companies to validate the use for clinical trials for new drug studies in a lot of countries. I think we're up to three variants of Arabic, Hebrew is coming out, Mandarin. Most of our tests are really language agnostic. We did that on purpose. The tests are very narrow, and with exception, the spelling and reading tests, which are dependent on language, particularly something like reading. Most countries take US-based tests and they hand translate them. That is not a good model. So I am very hopeful that we'll have additional translations of the NIH Toolbox® out there so people can use it natively. The advantage for testing kids, we're talking about countries that have no tests for kids. We've been approached by some countries. They literally have no tests to give. And with a one-time translation effort, this iPad app can be there. I love that it could be used to help people across the world.  

[00:20:11] Erin Spain, MS: It must be exciting for you to have worked on this project for these years, to see it through to what it is today and the impact that it's having. Can you just sum that up for me? What's it been like to lead this project? 

[00:20:23] Richard Gershon, PhD: I like large projects, and I enjoy bringing together people who know more than I do about different subjects and putting them in a room. But what's really fascinating about this NIH Toolbox®, about four years ago, I was asked to come and speak in China on the clinical use of the NIH Toolbox®. Then I turned to a student and said, do we have any clinical users of the NIH Toolbox®? And she went online and found 200 research groups had published the use of the NIH Toolbox®. I didn't know. It gets back to that analogy of we build the instruments, but we don't conduct it. And so it's been very exciting to me to see it's really helping, knowing if a child with cancer their cognition's being impacted and therefore it's being fixed. I envy clinicians who have one-on-one exposure to a patient and can see that they help them. And so my intrinsic payback is years later waiting for a research study to happen when I see that this is the case. And now moving these things onto a new generation of both researchers and I have a whole new team that did all the norming and all the analysis. That's exciting too, seeing that these tools will have a life that will, I'm not worried about outliving me, but outlive even my academic career so that they have a real use and really benefit kids and adults.  

[00:21:33] Erin Spain, MS: Well, thank you so much for coming on the show and telling us about the app and how it's being used in all of its different ways, especially here at Northwestern. We're excited to see what happens now with version 3. 

[00:21:45] Richard Gershon, PhD: Thank you so much, Erin, for having me on the show. 

[00:21:47] Erin Spain, MS: Thanks for listening, and be sure to subscribe to this show on Apple Podcasts or wherever you listen to podcasts and rate and review us. Also for medical professionals, this episode of Breakthroughs is available for CME credit. Go to our website,, and search CME. 

Continuing Medical Education Credit

Physicians who listen to this podcast may claim continuing medical education credit after listening to an episode of this program.

Target Audience

Academic/Research, Multiple specialties

Learning Objectives

At the conclusion of this activity, participants will be able to:

  1. Identify the research interests and initiatives of Feinberg faculty.
  2. Discuss new updates in clinical and translational research.

Accreditation Statement

The Northwestern University Feinberg School of Medicine is accredited by the Accreditation Council for Continuing Medical Education (ACCME) to provide continuing medical education for physicians.

Credit Designation Statement

The Northwestern University Feinberg School of Medicine designates this Enduring Material for a maximum of 0.25 AMA PRA Category 1 Credit(s)™. Physicians should claim only the credit commensurate with the extent of their participation in the activity.

American Board of Surgery Continuous Certification Program

Successful completion of this CME activity enables the learner to earn credit toward the CME requirement(s) of the American Board of Surgery’s Continuous Certification program. It is the CME activity provider's responsibility to submit learner completion information to ACCME for the purpose of granting ABS credit.

All the relevant financial relationships for these individuals have been mitigated.

Disclosure Statement

Richard Gershon, has nothing to disclose. Course director, Robert Rosa, MD, has nothing to disclose. Planning committee member, Erin Spain, has nothing to disclose. Feinberg School of Medicine's CME Leadership and Staff have nothing to disclose: Clara J. Schroedl, MD, Medical Director of CME, Sheryl Corey, Manager of CME, Allison McCollum, Senior Program Coordinator, Katie Daley, Senior Program Coordinator, and Rhea Alexis Banks, Administrative Assistant 2.

Claim your credit