1. ## Need a statistics question answered

For all you stats lovers, I'm trying to decide which of either the mode or the mean is the most appropriate measurement for an assessment.

Here's the situation: students are evaluated by a number of methods and I'm trying to be able to compare the data for an overall assessment. One method gives scores from multiple sources, for example, Question A is evaluated by 5 different sources and the student receives 1 score of 3 and 4 scores of 4. The mode is obviously 4 and the mean is 3.8. The benchmark for the trait is 4 (they are seniors afterall). The mean clearly skews below the benchmark, yet the student did very well. I'm not interested in individual student's outcomes for the program assessment and have to look at the trait for all the students in the cohort. Some students have 4 evaluators for this trait, some five, some six and so forth, so I think the mean is skewed and the mode would be better. The mean puts us below the benchmark on every trait, but the mode means we meet the benchmark (99% of students will meet the benchmark). I can do this and make it look good, but I'm not certain of the validity of the data (especially since some evaluators will give a student all 4s regardless of performance).

So, should I calculate a mode for each student for each trait and then a mode for the cohort, or calculate a mode for each student for each trait and take the mean for the cohort, or take a mean for each student for each trait and then a mean for the cohort?

All opinions will be given due consideration.

2. If the max score is a 4 and the benchmark is also a 4, using the mean doesn't seem like a good idea because, as you note, you won't meet the benchmark very often. The mode seems reasonable, but it is less commonly used so may cause some confusion among evaluators. The median might be a better alternative, just in terms of not causing confusion (depending on your audience). Also, the mode isn't unique, so that may cause some headaches when crunching the numbers.

Rather than using the mode, have you considered just calculating the % who meet the benchmark?

3. Originally Posted by jeffisjeff
If the max score is a 4 and the benchmark is also a 4, using the mean doesn't seem like a good idea because, as you note, you won't meet the benchmark very often. The mode seems reasonable, but it is less commonly used so may cause some confusion among evaluators. The median might be a better alternative, just in terms of not causing confusion (depending on your audience). Also, the mode isn't unique, so that may cause some headaches when crunching the numbers.

Rather than using the mode, have you considered just calculating the % who meet the benchmark?
That's what I did first, but the % was really low on the seniors. The evaluators just score the students on a 4 pt likert scale so they don't have to worry about the results. I then looked at the individual evaluations and averaged the scores. If a student received 1 3 and 4 4s, the average is 3.8. Which is less than 4, but 80% of the evaluators thought the student was a 4 for the trait. If I use the averages, which I did, the % of students who met the benchmark is significantly less than the benchmark of 99%. Obviously, I can lower the benchmark %, but I'll have to justify why we are willing to settle for less.

If I used the mode for that student, their value would be 4 which would raise the overall % of students who meet the benchmark. Which looks good, but I'm not certain is valid. I thought everything was working well when I was entering data into the outcome assessment grid until I got to the seniors. We were great for the sophomores and juniors and totally failed with the seniors, even though most got more 4s for a trait than 3s or the rare 2. At that point, I went straight for alcohol and watched American Ninja Warriors.

My advisory board meeting is today and I'm going to have to tell them annual report will be a bit late. Like a month. Or more.

4. I think you must be very smart, because I don't have the faintest idea what you are talking about...

5. If you are trying to make it look like the benchmarks are met, then use the statistic that shows that.

But if you are trying to have an accurate picture- and multiple other statistics are showing something different, then mode probably isn't the best measure.

6. Originally Posted by rfisher
If a student received 1 3 and 4 4s,
It's always tough when you have to objectify something subjective. Breaking it down to binary might help. I'm inclined to think that if the majority of evaluators give a 4 or 5, that student should be classified as meeting the standard. If that condition isn't met, they don't.

The other option would be to toss the outliers. The evaluator who gave the student the 1 is exaggerating the problem in your example.

7. Originally Posted by rfisher
That's what I did first, but the % was really low on the seniors. The evaluators just score the students on a 4 pt likert scale so they don't have to worry about the results. I then looked at the individual evaluations and averaged the scores. If a student received 1 3 and 4 4s, the average is 3.8. Which is less than 4, but 80% of the evaluators thought the student was a 4 for the trait. If I use the averages, which I did, the % of students who met the benchmark is significantly less than the benchmark of 99%. Obviously, I can lower the benchmark %, but I'll have to justify why we are willing to settle for less.
I wouldn't say that reducing the benchmark is settling for less. I would say that your previous benchmark (which was simply a 4) was not well-defined. Here is what I understand: You have a 4 point scale and believe senior students should meet a 4 in order to be qualified (or whatever). You have five evaluators for each student. What if all but one gives the student a 4? Is that student considered qualified? If the answer is yes, then you could argue that your benchmark for assessing the students shouldn't just be a single number since you have multiple evaluators. But first you need to decide whether the answer is yes.

If the rating scale is 1 to 4 and if there are multiple evaluators, maybe it doesn't make sense to set the benchmark to 4. Rather you could set the benchmark to X% of evaluators rate the student a 4? Or, if you want to stick with just one number for the benchmark, perhaps set the benchmark so that, in order to achieve the benchmark, a majority of evaluators need to rate the student as a 4, with the rest assigning a 3. For example, if there are five evaluators, then set the benchmark equal to 3.6, which implies that three out of the five evaluators gave the student a 4, while the other two gave the student a 3.

8. The evaluators just score the students on a 4 pt likert scale
I just noticed you said this. Having your benchmark also be the maximum score doesn't really give much valid feedback. It basically leaves no room for anything but perfect; which is unrealistic.

9. Can a four-point scale even be treated as quasi-interval so that means can be calculated? I'm not 100% sure, since most of the studies I've been involved in used either 5 or 7-point Likert scales.

10. Originally Posted by Zemgirl
Can a four-point scale even be treated as quasi-interval so that means can be calculated? I'm not 100% sure, since most of the studies I've been involved in used either 5 or 7-point Likert scales.
I believe the Likert scale is considered (by some anyway) to be interval, rather than ordinal, under the assumption that the gaps between the categories equal. I think that is usually pretty hard to guarantee, but it doesn't stop people from calculating means, etc., using Likert scale data.

11. My Dean just said one work around was to change the names of the rubric from capstone, milestone, introductory (used by the University) to unacceptable, novice, acceptable and expert and we could use acceptable for the benchmarks for the juniors and seniors, thereby, avoiding the expert dilemma. Which will work, except it's hard to show progression between juniors and seniors if they have the same benchmark.

I hate assessments. Particularly, when some of the evaluators give the same scores to all students regardless of actual performance, but I have to try and identify how students are really doing, analyze the data and develop an action plan to address the issues and demonstrate that we did so next year. We just keep going in the same circle.

I just saw jeffisjeff's post. That could work. I need to ponder the data.

12. Originally Posted by jeffisjeff
I believe the Likert scale is considered (by some anyway) to be interval, rather than ordinal, under the assumption that the gaps between the categories equal. I think that is usually pretty hard to guarantee, but it doesn't stop people from calculating means, etc., using Likert scale data.
I don't mean Likert scales in general; they are considered quasi-interval, though some people do indeed consider them ordinal. I meant specifically a situation like this, with four scale points. I am not sure you can consider a four-point scale quasi-interval.

13. Originally Posted by jeffisjeff
I believe the Likert scale is considered (by some anyway) to be interval, rather than ordinal, under the assumption that the gaps between the categories equal. I think that is usually pretty hard to guarantee, but it doesn't stop people from calculating means, etc., using Likert scale data.
I think there is pretty consistent argument about whether Likert scales are ordinal or interval.

Originally Posted by rfisher
I hate assessments.
Not to be rude, but this one kinda seems poorly designed.

Or the problem is that it isn't showing what you want it to show. In which case, that isn't the assessment's fault.

14. Originally Posted by Zemgirl
I don't mean Likert scales in general; they are considered quasi-interval, though some people do indeed consider them ordinal. I meant specifically a situation like this, with four scale points. I am not sure you can consider a four-point scale quasi-interval.
Oh. I am not sure, but I don't think a Likert scale has to have 5 or 7 (or an odd number) of levels, it is just the most common approach (so as to have a neutral middle option). I don't think the number of points (4 vs 5) matters in determining whether one can take means or not.

15. I thought you could only take a mean of the combined data (likert scale)? But median/mode of the individual questions (likert type questions).

Odd or even doesn't make a difference there. Many researchers will purposefully choose an even scale to force a decision.

16. I agree it's not a matter of odd/even. But my understanding is that you need some kind of minimal number of scale points for a scale to be considered quasi-interval. I'm just not 100% certain what that number is

17. or calculate a mode for each student for each trait and take the mean for the cohort
I think this would give the illusion of being more precise than it is. I'm not crazy about a mean of modes.

Or, if you want to stick with just one number for the benchmark, perhaps set the benchmark so that, in order to achieve the benchmark, a majority of evaluators need to rate the student as a 4, with the rest assigning a 3. For example, if there are five evaluators, then set the benchmark equal to 3.6, which implies that three out of the five evaluators gave the student a 4, while the other two gave the student a 3.
This is very similar to what I was going to suggest.

One problem with using the mode is that is does not distinguish the student who has a mode of 4 with three 4s and two 1s from another student with a mode of 4 but with three 4s and two 3s. Making a mean as your benchmark for each student such as Jeffisjeff suggested seems to be the best solution. So if, for example, the benchmark is 3.6, the only way a student could get (exactly) that would be with either three 4s and two 3s, or four 4s and one 2. And if you want a tougher benchmark, such as 3.8, the only way the student can get (exactly) that is with four 4s and a three.

Then, at the summary level, you can report the % of students who met the benchmark of 3.6 or 3.8, or whatever seems to be the most appropriate number.

ETA: One thing I think you should think about is what truly is most important: the average or the mode. This might sound obvious, but think about some of the combinations of scores. For example, a student with a mode of 3 could actually have a higher mean than a student with a mode of 4. That is, a student with scores of 33344 has a mode of 3, but a mean 3.4. Another student with scores of 11444 has a mode of 4, but a mean of only 2.8. Which one is closer to meeting the meaning of the benchmark? That might clue you in to the best method. (Based on what you've said so far, though, I do think a mean benchmark for each student, then a summary score of the % of students who hit the mean will work best, especially if the benchmark mean is quite high, since the higher the mean benchmark, the less you have to worry about the mean differing from the mode.)

18. Originally Posted by Skittl1321
I think there is pretty consistent argument about whether Likert scales are ordinal or interval.

Not to be rude, but this one kinda seems poorly designed.

Or the problem is that it isn't showing what you want it to show. In which case, that isn't the assessment's fault.
Oh, no offence. This is a program wide problem in medical imaging education. Someone is even doing a master's thesis on designing a quality clinical instructor evaluation in order to collect data on student learning outcomes. We have program effectiveness data, but that's not the same as student learning outcomes. The clinical instructors tend to just check the same box on the form which doesn't tell you anything. We've figured out how what we're going to attempt to do, but consider there are 14-24 students in a cohort, they get evals from 1-5 different instructors and results are easily impacted by 1 student or 1 instructor. I don't use this as the only assessment outcome tool. The biggest problem is I have to satisfy both accreditation agency outcomes and university outcomes which can be very different albeit the data set is the same and can't be changed. So, I'm trying to do a statistical analysis on a very limited data set with the inherent problems of trying to get objective results from subjective data.

We're sticking with the average I did of each student's scores and then doing a straight percentage of students who met or exceeded the benchmark which we're setting at 3.5 for a senior student. I'll be adjusting the benchmarks for the sophomores and juniors as well. Of course the bigger problem in the assessment is what we intend to do with the students who don't meet the benchmark. I have to give a detailed report on how we intend to address that because ideally they should all score 4 which is impossible. But, my accreditation agency wants to know why they didn't and what we plan to do about it before we release them on the public as radiographers. No idea what I'm going to say there, but I'll think of something this week.

But, my Dean loved my new quantitative literacy, problem solving and intercultural competency rubrics that will be used next year. I'm still trying to explain the difference between quantitative literacy and problem solving to my faculty. It would be easiest to just use my courses for assessments but I teach senior level classes and only one junior level and it's physics. I'm making everybody use the same rubric for oral and written communication and critical analysis as well. The Dean and I spent 15 minutes explaining that the literature review being used by the research methods class was not critical thinking in spite of the instructors stubborn insistence it was because she doesn't want to have to change anything.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•