INDEX
Explanations
phrases expressing understanding or comprehension
expressions of understanding or confusion
New Auto-Interp
Negative Logits
velt
-0.65
worthy
-0.63
foss
-0.59
erness
-0.59
haus
-0.59
agen
-0.59
feasibility
-0.58
ante
-0.57
yet
-0.57
Britann
-0.57
POSITIVE LOGITS
bored
0.98
tired
0.95
annoyed
0.93
rid
0.93
distracted
0.90
punished
0.86
yelled
0.84
irritated
0.83
sucked
0.82
confused
0.82
Activations Density 0.080%