Peer Assessment for Testing Classroom Chinese Speaking in a Japanese University: Correlations and Attitudes

By Ming Qu and Margit Krause-Ono


According to Falchikov (1995), peer assessment is a process in which a group of individuals grades their peers and which may or may not involve a set of criteria by teachers and students. In recent years, peer assessment has been increasingly used as an alternative method of assessment in language learning classrooms. Most researchers believe that it is not only an effective tool for encouraging students’ learner independence and autonomy, but also allows teachers to shift their teaching methodology to more students-centered activities. Numerous studies have been conducted on peer assessment, but only a few studies have focused on Japanese language learners. The question therefore remains as to whether peer assessment is really an effective tool for language learning in Japan, particularly in the field of teaching Chinese as a second foreign language. In order to answer this question, this paper will focus on two points associated with peer assessment: the first one is how reliable is the correlation between peer ratings and teacher ratings, and the second one is the Japanese students’ attitude towards peer assessment.

Problems with previous studies

Issues of correlation between peer and teacher ratings
A number of studies have been conducted with regard to the correlation between peer and teacher ratings, with some of them indicating that there is a high correlation between the two (e.g. Hughes & Large, 1993; Brammer & Taylor, 2001; ALfallay, 2004; Shimura,2006; Fukazawa, 2009). However, some studies also indicated that there is no strong correlation between peer and teacher ratings (e.g., Jafapur, 1991, Freeman, 1995).

Hirai (2011) indicated that the reasons for the conflicting results involve differences in assessment conditions and students’ characteristics among these studies. She compared the results from nine previous studies on the correlation between peer and teacher ratings, and showed that there appeared to be three points related to the correlation. The first one is that there is a tendency toward higher correlations when students rated a peer’s performance after discussion rather than when students assigned a rating without engaging in prior discussion. The second is that the correlation is higher when using mean score (averaging all the scores of the participants) than when using a single score. The third point is that correlation appears to be related to anonymity. Under anonymous conditions the correlation between peer and teacher ratings is higher. This is because anonymity helped reduce the anxiety felt by raters regarding potential accusations of excess severity by their peers. With regard to the prior discussion, Hirai (2011) used a rating scale which was developed by teachers, and a detailed explanation which was given to students. In this study, the students collaborated with the teacher in making a rating scale in order for them to understand the rating scale more thoroughly.

Beside the above three points, the kind of rating scale used is also expected to be related to the correlation between teacher and peer ratings. A study which was conducted by Shimura (2006) used a rating scale which included 8 categories. They were: good posture, clear voice, good eye contact, good gestures, clear explanation, good visuals, good analysis, and good organization. She focused on the contents of the presentation and body languages. This differs from Fukazawa (2009) and Hirai (2011) who focused more on linguistic aspects, such as grammar, pronunciation, vocabulary, and fluency. These studies used different rating categories, therefore the results on the correlations between the students and the teacher were also different. The previous studies didn’t shed light on every category, and only calculated the whole score for the rating scale. When considering the correlation between the students and teacher ratings, we need to clarify which of these categories have a high correlation and which categories have a low correlation. Therefore, in this study, the rating scale used was divided into three sections: body language, presentation content, and linguistic aspects. The correlation of each category between the students and teacher ratings was calculated.

Issues of Japanese students’ attitudes towards to peer assessment
Peer assessment has received much attention in the field of language teaching in recent years, but in Japan, this idea is still novel, especially in the field of teaching Chinese as a second foreign language. Traditional testing, such as paper tests, is still dominant. Alternative assessment methods, such as portfolios, peer / group oral test, and peer assessment are not widely used in language teaching classrooms in Japan. Some previous studies addressed the attitude of students to peer assessment (e.g. Azarnoosh, 2013; Wen, 2006; Peng, 2010, White, 2009), but they did not focus on Japanese students. Simon (2014) focused on the attitude of Japanese students, and conducted an online survey of first year students enrolled in oral communication classes at a private Japanese university. Students were asked to answer 10 questions related to peer assessment. The results revealed that Japanese students were broadly accepting of peer assessment, which was perceived as being a valuable language-learning tool. Simon (2014) only performed the survey once, after students had experienced peer assessment activities, and only quantitative data was used in his study. In order to gain a more complete picture of the attitude of Japanese students, an understanding of their attitude both before and after experiencing peer assessment is needed. Furthermore, in order to clarify what the students think about peer assessment, qualitative data should be used. Therefore in this study, a survey of student attitude was performed twice, both before and after experiencing the peer assessment activities in their language class, and the significance of any changes in attitude between the pre-survey and post-survey were analyzed Furthermore, a semi-structured group interview was conducted for collecting qualitative data on the students’ attitudes towards the peer assessment activities.

The purpose of this study
This study aims to broaden the knowledge of peer assessment by exploring which categories in the rating scale have high correlation coefficients and which categories have low correlation coefficients between the student and teacher ratings. By performing the survey twice, before and after the students experienced peer assessment, it was hoped that the students’ attitudes towards to peer assessment would be clarified, and the changes of the students’ perceptions would be revealed. This study answers the following questions and sub-questions:

RQ1: To what degree does peer assessment correlate with the teacher’s assessment? Which categories have high correlation coefficients and which categories have low correlation
RQ2: To what extent do students change their perceptions after experiencing peer assessment? What are the reasons for their changes in attitude?


The university
This study was conducted at a university in Hokkaido, Japan, which consists of only one faculty – the Faculty of Technology. Every year there are over 600 freshmen, 90% of whom are male. The study of a second foreign language, from Chinese, German or Russian, is compulsory for first-year students. There are around 25 students in each class, with 12 classes each for Chinese and German, and 2 classes for Russian. All foreign languages must be taught according to the CEFR A1 level. For second-year students, the study of a second foreign language is an optional subject, and it is taught according to the CEFR A2 level. This study was conducted among second year students.

Participants – Students
Eighty-Two Japanese students participated in this study. They belonged to three classes taught by the researcher. The participants’ majors included information technology, engineering, and science. The class met once a week for 90 minutes.

Participants – Teacher
M is female, with about 15 years of teaching Chinese as a foreign language experience at the time of the study.

The presentation
The students were asked to give a presentation on introducing his or her hometown, family members, and himself or herself by PowerPoint. The presentation was assessed by the teacher and his or her peers at the same time in the class. Students were asked to fill out the score sheet which included seven rating categories, scored from 1 to 5, with 1 being poor and 5 being excellent, by circling the appropriate number for each category. A sample of the score sheet is shown in Table 1.


In the next stage, the students were asked to discuss the points to be rated within each category. For example, in order to clarity the rating points for pronunciation, they were asked to discuss what good and bad pronunciation of Chinese is, particularly for Japanese students. And are these points operable when the assessment is conducted. The final rating points for each category are shown in Table 2.


Instruments and procedures – Five-point Likert scale survey
A five-point Likert scale survey was used to investigate the attitude of Japanese students to peer assessment. The survey was created by the researcher based on Wen, Tsai & Chang (2006) and Peng (2010). It contains six statements about peer assessment. Students were given five choices for each statement, 1) strongly agree, 2) agree, 3) neutral, 4) disagree, 5) strongly disagree, and they were asked to choose one of the choices. Furthermore, the participants were asked to fill out the five-point Likert scale twice to allow comparisons to be made. The pre-survey was conducted two weeks before the speaking test. And after experiencing peer assessment activities, the students were asked to do the same survey again, we will call it the post-survey. The five-point Likert scale used in this study is shown in Table 3.


Instruments and procedures – Semi-structured group interview
Semi-structured group interviews were used to explore the reasons for the changes, or lack thereof, in student attitudes to peer assessment and other points related to this assessment form. Students were divided into five groups, with each group consisting of 4~5 students. They were asked to discuss the positive and negative points of peer assessment first, and then answer the questions from the teacher. There were 2 questions: the first one was, “What do you think of peer assessment? What are the good points and bad points? ” The second one was “Did you change your attitude before and after experiencing peer assessment? If you changed your attitude, what is the reason? ”

Instruments and procedures – Analysis
Microsoft Excel (2000) was used for analyzing the data. Descriptive statistics were calculated first, and then a Spearman’s correlation analysis was conducted to explore the correlation between the peer ratings and teacher ratings. Finally a t-test was conducted to explore the significance of changes in attitude pre-survey and post- survey.


The correlation between students and teacher’s rating
Table 4 presents the descriptive statistics for the peer and teacher assessments. Other than pronunciation and body language, the mean score of peer rating for each category was slightly higher than those of the teacher’s. This indicated that, compared with the teacher’s rating, the ratings of the students for those rating categories tended to be lenient, while those for pronunciation and body language, the students’ ratings tended to be strict. The standard deviation (SD) for vocabulary, grammar, and fluency were slightly lower than those of the teacher’s, this indicated that the teacher rated the presentations across a wider range, while students rate their peers within a narrower range in these categories.


In order to investigate what degree peer assessment correlated with the teacher’s assessment, the Spearman’s correlations analysis between peer and teacher’s assessment was conducted. The results are shown in Table 5. The results revealed that, there were high correlation coefficients for pronunciation, presentation content, design of the PPT file, and body language, while the correlation coefficients for vocabulary, grammar, and fluency were low. Body language had the highest correlation coefficients (r = .42), while vocabulary had the lowest correlation coefficients (r = .17).



Creating the rating scale
In this study, the students were involved in developing the rating scale together with the teacher. The students were asked to imagine if they were teacher, what type of analytic rating scale they would use to evaluate the speaking ability of their students. The students suggested more than ten rating categories including facial expression, voice quality, interesting content, fluency, design of the PowerPoint file, organization of the content, pronunciation, accuracy of the grammar and vocabulary, natural expression, posture, and so on. Students were told that too many rating categories would be burdensome for the raters, so they needed to choose five or six categories. However, as this study sought to focus not only on linguistic aspects, but also body language and the presentation content, so at last, seven categories were eventually decided upon: pronunciation, grammar, vocabulary, fluency, presentation content, design of PPT file, and body language.

Japanese students’ attitudes towards to peer assessment – To what extent did students change their perceptions after experiencing peer assessment activities?
In order to investigate the Japanese students’ attitudes towards peer assessment, especially the extent to which students change their perceptions after experiencing peer assessment activities, a t-test analysis was conducted. Table 6 gives the descriptive information and scale score differences between the students’ ratings pre-survey and post-survey.


The results of the t -test revealed that students’ responses were higher than the neutral score (3.00) both pre-survey and post-survey. In the pre-survey, the mean score was 3.41, while the mean score in the post-survey was 4.06. Thus, it can be said that, the students reacted positively both before and after experiencing the peer assessment activities. The standard deviation for the pre-survey was much higher than that for the post-survey, indicating that the range in student attitude was wider before experiencing peer assessment than after. The mean score increased from 3.41 to 4.06 (t (81) = 6.41, P < .01), showing that the Japanese students’ attitude towards peer assessment became significantly more positive after experiencing the assessment activities. The effect size was calculated to examine the significance of the score differences between the pre- and post- surveys, the value of the effect size was 0.96, which is considered to be large enough according to Cohen’s (1988) definition.

Japanese students’ attitudes towards to peer assessment – Reasons for changes in students’ attitudes
In order to explore why students changed their attitudes to peer assessment and other points related to this assessment form, semi-structured group interviews were conducted. The students were asked two questions. One is “what do you think peer assessment?” The other one is, “did you change your attitudes before and after experiencing peer assessment? If your attitude changed, what is the reason?” Students were divided into small groups, one group consisting of 4~5 students.

The students’ answers were divided into two categories: positive responses and negative responses. Responses were given in Japanese and translated into English by the author. The content of responses for each group is shown below. More than 80% of the students gave positive responses, with less than 20% of students giving negative responses.

Positive responses
Peer assessment is helpful
・Peer assessment is helpful to learn about speaking ability, before experiencing the speaking test peer assessment , I never considered what speaking ability is, how should we assess speaking. It is very helpful for learning to speak.
・Peer assessment helps make me understand the criteria more fully.

Peer assessment is motivating
・When I give other students a high score: for example, he has a good posture, good eye contact, or good design for the PPT file, I get the feeling that I should learn from this student, and should do as well as him.
・Peer assessment encouraged my autonomy, I know what next step I should make to improve my Chinese speaking.
・I practiced the presentation many times before the test, because I wanted to get good scores from my peers.

Peer assessment is useful
・When assessing other students’ presentations, we could identify our own weaknesses.
・Through assessing other students’ presentations, I could identify their strengths and weaknesses, then I could improve myself.

Peer assessment is interesting
・This is the first time for me to do a peer assessment, it is interesting, I felt that I was acting like a teacher, and had a strong sense of participation.
・Peer assessment helps me understand what teachers think of us, it is a very interesting experience. I started to consider teachers’ (or other people’s) feeling even outside of the class.

There were also some negative answers about peer assessment.
Negative responses:
Peer assessment is difficult
・Peer assessment is difficult. Sometimes, I really don’t know how to assess the other students, especially when assessing vocabulary and grammar, as I couldn’t, in fact, catch everything said by the presenters.
・It is difficult to assess the other students, so I think the students’ ratings are not reliable, I also don’t like an arbitrary grading.

Peer assessment is boring
・At first, it was interesting, but there are too many students in one class, I soon felt bored
・I just wrote 3 for all the categories, because it was boring.

Peer assessment is troublesome
・I think peer assessment is troublesome and a waste of time, I ‘d rather learn something from the textbook instead of assessing other students.


Correlation between the students’ and teacher ratings
In this study, seven rating categories were included in the rating scale, and the correlation coefficients were calculated for every category between the students’ and teacher ratings. The results revealed that four rating categories (pronunciation, presentation content, design of the PPT file, and body language) had high correlations coefficients, while three rating categories (vocabulary, grammar, and fluency) had low correlations coefficients.

Except pronunciation, the rating categories with high correlations coefficients were all categories related to the content of the presentation and body language. The rating points were very clear; for example, there were three rating points for body language- eye contact, posture, and gestures, so a decision was easily made. There was only one point for presentation content and the design of the PPT file, so it was also clear enough to make a judgment. It is assumed that the content, design of the presentation and body language were easy to judge by the students, hence these categories had high correlations coefficients. Pronunciation is a rating category that focuses on linguistic elements of the presentation, with the rating points involving the use of Japanese-influenced sounds, and ease of understanding. It appears the students were able to distinguish between good and bad pronunciation very well, even though their own pronunciation of Chinese may not be adequate. When creating the rating scale together with the students, a lot of examples of what is good Chinese pronunciation were given, for example, the blade-palatal sounds, such as “zh” , “ch”, “sh”, and “r”, bilabial sounds, such as “b” and “p”, and compound finals, such as “ang”, “eng”, “ing”, and “ong”. It is difficult for Japanese learners to pronounce these sounds, but they can understand good pronunciation when listening. Therefore the pronunciation category also had a high correlations coefficient.

On the contrary, the data indicates that it was difficult to judge the rating categories related to vocabulary, grammar, and fluency. These three categories had low correlation coefficients. Vocabulary and grammar focus on linguistic elements, and it is possible that the students are incapable of identifying errors because they lack the language knowledge necessary to identify them. Vocabulary had the lowest correlation coefficients (r = .17). The results of the semi-structured group interviews also revealed that some students felt it was very difficult for them to make judgments regarding vocabulary and grammar. In this study, the participants were in their second year of Chinese classes, and they were taught according to the CEFR A2 level, so it is possible that the students’ language proficiency was not high enough to assess a wide range of vocabulary and structures, accuracy of the grammar or words choices. Nelson & Carson (1998) conducted a study on peer assessment of English writing, and pointed out that a lack of language proficiency in a second language affects peer review, as learners cannot review their peers’ writing appropriately because of their low proficiency. The same problem appears to exist in peer assessment of second language speaking. Language proficiency level is an important factor that influences the correlation between the students and teacher ratings, particularly in those rating categories which are related to linguistic elements. In Japan, there are a limited number of advanced students in second foreign language classes, and low level students cannot give presentations using a second foreign language. Therefore, in this study, only the data for intermediate level students were analyzed, and which is a limitation to the study that needs to be resolved in the future.

Japanese students’ attitudes towards to peer assessment
This study clearly showed that Japanese students held a positive attitude towards peer assessment both before and after experiencing peer assessment activities. With the mean score increasing significantly after experiencing the peer assessment activities. Furthermore, the standard deviation for the pre-survey was much higher than that for the post-survey, indicating that the students’ attitudes ranged widely before experiencing the peer assessment, but narrowed after experiencing the peer assessment activities. The results of the semi-structured group interviews showed that more than 80% of the students responded positively regarding the peer assessment. The positive answers included comments that peer assessment is helpful, motivating, useful, and interesting, while the negative comments suggested peer assessment is difficult, boring, and troublesome. Both quantitative and qualitative data showed the Japanese students had generally positive attitudes towards to peer assessment.

With regard to the negative comments regarding peer assessment, the students possibly felt it was difficult as they lacked sufficient language knowledge to make judgments using the rating scale. While there were some students who felt it was boring and troublesome, this may have resulted from the procedure used in this study. The students were asked to assess more than 20 peers in one class, and this may have left them feeling bored and burdened. Hirai (2011) asked students to record their speaking on tape in a language lab, and assess the student next to them only. This procedure may result in the students having less negative feelings toward to the assessment task.


The results of peer assessment are often influenced by the contexts and circumstances in which peer assessment is administered. Therefore, it is necessary to pay great attention to explaining the procedure. Language proficiency level is an important factor that influences the correlation between the students’ and teacher ratings, particularly in those rating categories which are related to linguistic elements, so for the students whose language proficiency is low, the peer assessment for linguistic elements maybe should be avoided. The teachers should know that not every student likes peer assessment, so peer assessment activities should be short and easy everytime. If students demonstrate an enjoyment and a willingness to observe and assess their peers, the peer assessment can be a useful tool in students’ language learning.


Ming Qu is an Associate Professor of Muroran Institute of Technology, Japan. Her interests include language testing, CEFR based language teaching, and Sinology (China’s cultural diplomacy).

Margit Krause-Ono is Professor of German, European Culture, and Intercultural Communication at Muroran Institute of Technology, Japan. She holds degrees from France and Australia, and a Certificate as intercultural trainer/coach from Friedrich Schiller University, Germany.