INDEX
Explanations
names of people or characters
New Auto-Interp
Negative Logits
".
-0.65
inh
-0.65
appre
-0.63
!".
-0.63
elig
-0.62
Learns
-0.62
").
-0.62
indo
-0.61
whereas
-0.60
".
-0.57
POSITIVE LOGITS
alike
1.03
axter
0.88
oliath
0.80
cohorts
0.72
colleagues
0.67
ossal
0.66
mates
0.61
others
0.61
are
0.61
teammate
0.60
Activations Density 0.380%