Evaluating AI in Teacher Performance Reviews

Table of Contents

1. Introduction

1.1 Defining the problem
1.2 What will you learn?
1.3 Who is this tutorial for?

2. Practical Exercises

2.1 Installing Gemini 1.5 Flash
2.2 Exercise 1: Your first function with Gemini
2.3 Exercise 2: Rating student reviews and prompt engineering
2.4 Exercise 3: Iterating over a set of student reviews
2.5 Scripts and files

3. Evaluation

1. Introduction

1.1 Defining the problem

In early 2024, Chalkbeat reported that The Urban Assembly school network would invest half a million dollars to develop a set of AI tools that could analyze teacher sentiment during instruction. The rollout sparked debate around the use of AI in teacher performance evaluations. A primary concern is the lack of transparency regarding the data AI uses for sentiment analysis, making it difficult to identify potential biases, especially those related to accent, dialect, register, and non-verbal communication. Despite these concerns, a study by professors at the Indiana University Kelley School of Business finds that many employees perceive AI as less biased than human evaluators in performance reviews.

1.2 What will you learn?

In this tutorial, you'll try your hand at building your own AI tools that analyze sentiment in student feedback with the Google Gemini LLM and Python. Through experience using the technology, you'll better understand the benefits, biases, and best practices of this approach to teacher performance evaluations. You'll learn how to:

Configure the Gemini API, integrate it into Python functions, and analyze a set of student feedback with these functions.
Refine function output by way of prompt engineering.
Critically analyze the use of AI in teacher performance reviews.

This is just an introduction. By the end, you'll have the essential skills needed to tackle more complex AI projects.

Complete Python scripts and related files for each exercise are provided in Chapter 2.5.

1.3 Who is this tutorial for?

This tutorial is designed for educators and school administrators who are interested in exploring the potential of AI in teacher evaluation. It's also a great starting point for anyone in the general public who wants to go from being a consumer to creator by building their own custom AI tools.

A basic understanding of Python is required before starting the tutorial.

2. Practical Exercises

2.1 Installing Gemini 1.5 Flash

Gemini 1.5 Flash is Google’s free, lightweight LLM that can handle 10,000+ words in a prompt. We'll need to register for a Gemini API key and install Google's Generative AI Python library.

Follow the directions to generate your API key. Once generated, copy and save the key in a new file on your computer called: gemini-api-key.txt Do not share this key with anyone.
pip install Google's Generative AI Python library:

pip install google-generativeai

2.2 Exercise 1: Your first function with Gemini

With everything installed, we’re ready to take Gemini for a test drive. Our first project will be an AI tool that creates an anagram of a person's name.

Let's get our files organized. Create a new folder called Exercise_01 and put the gemini-api-key.txt file inside the folder. Create a new Python file in the folder and call it anagram.py Your file structure should look like this:

    —— Exercise_01
            — gemini-api-key.txt
            — anagram.py

Inside the IDE of your choice, open anagram.py. We'll test that everything is in order by printing the contents of gemini-api-key.txt to the terminal:

with open('gemini-api-key.txt') as f:
    genai_api_key = f.read()

print(genai_api_key)

Remove the print function. Next we'll import Google's Generative AI library and configure the language model. In just two short lines of code, you'll establish a connection to the Gemini API and create an object called "model" that can generate text with the gemini-1.5-flash LLM:

import google.generativeai as genai

#Read Gemini API key into memory.
with open('gemini-api-key.txt') as f:
    genai_api_key = f.read()
    
#Configure API and select the Gemini 1.5 Flash language model.
genai.configure(api_key=genai_api_key)
model = genai.GenerativeModel('gemini-1.5-flash')

With Gemini configured, let's create a function that'll create anagrams using Gemini. We’ll start with a simple function called create_anagram(name) that takes a person's name as input. Inside the function, we'll use model.generate_content() to send a prompt to Gemini, asking it to generate an anagram based on the given name. I’m using a Python feature called an f-string (formatted string literals) to add our function’s name argument at the end of the prompt: Create one anagram with the letters in this person’s name: {name}.

To retrieve the generated anagram as text, we'll use response.text. If we omit the .text attribute, the function will return metadata about the generated text.

Let's try running the function with and without .text to see the difference. What anagrams can we create for Albert Einstein?

def create_anagram(name):
    """Create one anagram of the input name"""
    response = model.generate_content(f"Create one anagram with the letters in this person's name: {name}")
    #Experiment running the function with .text and without .text.
    return response.text

print(create_anagram("albert einstein"))

2.3 Exercise 2: Rating student reviews and prompt engineering

Now that we understand the basics of the Gemini API, let's build a new AI tool that analyzes student reviews for their sentiment.

Create a second folder called Exercise_02. Inside the folder, create a new Python file called rate-sentiment.py. Then download student-review.txt and add it to the folder. Your file structure should look like this:

    —— Exercise_02
            — gemini-api-key.txt
            — rate-sentiment.py
            — student-review.txt

Read this student review into memory and print it to the terminal as we did earlier with your API key:

#Read student review into memory.
with open('student-review.txt') as f:
    student_review = f.read()

print(student_review)

When you run the above code, you should see this student review in the terminal:

Stats class was definitely challenging, but I learned 
a lot! The teacher was helpful, and class discussions 
were interesting. However, I struggled with the
assignments. The instructions were often confusing,
and I spent way too much time trying to figure out
what was expected. I wish the assignments were more
clear-cut.

We can use Gemini to rate (quantify) the sentiment of this student review. A simple way to rate sentiment is with a Likert scale of 1-5, with 1 representing Very Negative and 5 representing Very Positive sentiment.

Directing Gemini to effectively rate sentiment takes us into an area called prompt engineering, which is the process of crafting instructions to guide AI models towards desired outputs. Prompt engineering is an iterative process, so by refining our instructions we'll discover how to write a prompt that produces the most desirable output.

Let's start by creating a function that will accept student feedback as a paragraph. This should look familiar from earlier:

import google.generativeai as genai

#Read Gemini API key into memory.
with open('gemini-api-key.txt') as f:
    genai_api_key = f.read()

#Read student review into memory.
with open('student-review.txt') as f:
    student_review = f.read()

#Configure API and select the Gemini 1.5 Flash language model.
genai.configure(api_key=genai_api_key)
model = genai.GenerativeModel('gemini-1.5-flash')

def rate_sentiment(text):
    """Rates the positivity of a text on scale of 1-5: 
    5 = Very Positive. 1 = Very Negative.
    """    
    response = model.generate_content() 
    return response.text

#Apply the function to our text file and see Gemini's response.
print(rate_sentiment(student_review))

We can use this prompt: Rate the positivity of this student's review of a class on a scale of 1-5, with 1 representing Very Negative and 5 representing Very Positive. We can then use an f-string again like this:

def rate_sentiment(text):
    """Rates the positivity of a text on scale of 1-5: 
    5 = Very Positive. 1 = Very Negative.
    """    
    response = model.generate_content(f"Rate the positivity of this student's review of a class on a scale of 1-5, with 1 representing Very Negative and 5 representing Very Positive: {text}") 
    return response.text

print(rate_sentiment(student_review))

Run the code and you should get a response similar to this in the terminal:

I would rate this review a **3/5**. 

Here's why:

* **Positive:** The student acknowledges learning a lot, appreciates the professor's helpfulness, and finds the class discussions engaging.
* **Neutral:**  The student's experience with the assignments is mixed. While they struggle, the student doesn't express outright dissatisfaction with the professor or the class.
* **Negative:**  The student explicitly expresses frustration with the assignment instructions and wishes they were clearer. 

Overall, the review is balanced, highlighting both positive and negative aspects. It's not overly enthusiastic but also not overtly critical.

The good news is that Gemini understands the prompt and is giving us a meaningful response. However, for our purposes, we just need the single digit Likert score and not the full justification. What we learn from Gemini's justified response is that we need to refine our prompt to be more specific about our desired output. This is prompt engineering. Also, if we require just a single digit, we should apply the Python .strip() method to remove any incidental white space from our output. Let's try this:

def rate_sentiment(text):
    """Rates the positivity of a text on scale of 1-5: 
    5 = Very Positive. 1 = Very Negative.
    """    
    response = model.generate_content(f"Rate the positivity of this student's review of a class on a scale of 1-5, with 1 representing Very Negative and 5 representing Very Positive. Your output should be in the form of a single digit representing your positivity score: {text}") 
    # .text returns just text from Gemini and .strip() removes whitespace.
    return response.text.strip()

print(rate_sentiment(student_review))

Let's recap: we’ve successfully passed our data to Gemini, had it analyze that data for sentiment positivity, and refined the format of its output through prompt engineering. Next we’ll apply our rate_sentiment() function to an entire set of student feedback.

2.4 Exercise 3: Iterating over a set of student reviews

In this section we will learn how to apply our Gemini-infused function to analyze a large dataset of student reviews.

Create a third folder called Exercise_03 and within it create a Python file called rate_student_reviews.py. I've prepared a collection of 10 student reviews that we'll use to practice iterating over a dataset. Download student-reviews-dataset.csv and add it to Exercise_03. Your file structure should look like this:

    —— Exercise_03
            — gemini-api-key.txt
            — rate_student_reviews.py
            — student-reviews-dataset.csv

To read the CSV dataset into memory and apply our rate_sentiment() function from Exercise 2 to an entire set of reviews, we'll use pandas, which is a very popular Python library for data manipulation and analysis. If you are not familiar with pandas already, it is well worth your time to become familiar with it through this tutorial. You can use pip to install pandas:

pip install pandas

Once installed, create a dataframe from student-reviews-dataset.csv and print the resulting dataframe to take a look at our data:

import google.generativeai as genai
import pandas as pd

# Read student review data into memory.
df = pd.read_csv('student-reviews-dataset.csv')
print(df)

You should see that the dataframe has two columns, student_name and review and that there are reviews from 10 students.

One powerful function in the pandas library is .apply(), which allows us to create a new column in our dataset that is the result of iterating over and manipulating the review column data with our rate_sentiment() function from Exercise 2. Copy the rate_sentiment() function into rate_student_reviews.py:

import google.generativeai as genai
import pandas as pd

#Read Gemini API key into memory.
with open('gemini-api-key.txt') as f:
    genai_api_key = f.read()

#Configure API and select the Gemini 1.5 Flash language model.
genai.configure(api_key=genai_api_key)
model = genai.GenerativeModel('gemini-1.5-flash')

def rate_sentiment(text):
    """Rates the positivity of a text on scale of 1-5: 
    5 = Very Positive. 1 = Very Negative.
    """    
    response = model.generate_content(f"Rate the positivity of this student's review of a class on a scale of 1-5, with 1 representing Very Negative and 5 representing Very Positive. Your output should be in the form of a single digit representing your positivity score: {text}") 
    # .text returns just text from Gemini and .strip() removes whitespace.
    return response.text.strip()

# Read student review data into memory.
df = pd.read_csv('student-reviews-dataset.csv')

#Iterate over student reviews with Gemini
df['sentiment_rating'] = df['review'].apply(rate_sentiment)

To do any kind of statistical analysis of our rated reviews, we need to be sure that they are integers. Use pandas' .dtypes attribute to see what kind of data we have in each column of our dataframe:

output = df.dtypes
print(output)

In your terminal you'll see that in fact the data we have in our sentiment_rating column are not integers:

student_name        object
review              object
sentiment_rating    object
dtype: object

We can easily rectify this by using pandas to convert sentiment_rating data to integers (int):

# Read student review data into memory.
df = pd.read_csv('student-reviews-dataset.csv')

#Iterate over student reviews with Gemini
df['sentiment_rating'] = df['review'].apply(rate_sentiment)

# Convert sentiment_rating to integers
df['sentiment_rating'] = df['sentiment_rating'].astype(int)

result = df.dtypes
print(result)

Run this code again, and you'll see that sentiment_rating data are now integers (int64). At this point, we can use pandas to get a statistical summary of our data using the .describe() function. Here is what your full code should look like:

import google.generativeai as genai
import pandas as pd

#Read Gemini API key into memory.
with open('gemini-api-key.txt') as f:
    genai_api_key = f.read()

#Configure API and select the Gemini 1.5 Flash language model.
genai.configure(api_key=genai_api_key)
model = genai.GenerativeModel('gemini-1.5-flash')

def rate_sentiment(text):
    """Rates the positivity of a text on scale of 1-5: 
    5 = Very Positive. 1 = Very Negative.
    """    
    response = model.generate_content(f"Rate the positivity of this student's review of a class on a scale of 1-5, with 1 representing Very Negative and 5 representing Very Positive. Your output should be in the form of a single digit representing your positivity score: {text}") 
    # .text returns just text from Gemini and .strip() removes whitespace.
    return response.text.strip()

# Read student review data into memory.
df = pd.read_csv('student-reviews-dataset.csv')

#Iterate over student reviews with Gemini
df['sentiment_rating'] = df['review'].apply(rate_sentiment)

# Convert sentiment_rating to integer
df['sentiment_rating'] = df['sentiment_rating'].astype(int)

# Use .dtypes to get a summary of data types
result = df.dtypes
print(result)

# Use describe() to get summary statistics of sentiment_rating
summary_statistics = df['sentiment_rating'].describe()
print(summary_statistics)

When you run this code, you'll have some useful information for assessing students' sentiments about a class. For example, the average (mean) and quartiles. You'll also get the highest rating (max) and lowest rating (min):

count    10.000000
mean      2.800000
std       1.549193
min       1.000000
25%       1.250000
50%       3.000000
75%       3.750000
max       5.000000

2.5 Scripts and files

Exercise_01

Exercise_02

Exercise_03

3. Evaluation

Now that you have firsthand experience using Gemini to rate the sentiment of student feedback, let's return to the problem we started with: bringing this technology to bear on teacher performance evaluations.

What are the potential benefits, limitations, and/or drawbacks of using AI sentiment analysis in teacher evaluations?
Considering the limitations you may have encountered with the AI tool, propose one best practice for ensuring fairer and more reliable results when using AI for teacher performance evaluation.
Based on your experience, do you think AI sentiment analysis could be a valuable tool in teacher evaluation, even with its limitations? Why or why not?

Evaluating AI in Teacher Performance Reviews:

Benefits, Biases, and Best Practices