INDEX
Explanations
references to individuals and their emotional states or actions
New Auto-Interp
Negative Logits
fool
-0.16
Stre
-0.14
Fool
-0.13
atto
-0.13
874
-0.12
weise
-0.12
bout
-0.12
ange
-0.12
ly
-0.12
itself
-0.12
POSITIVE LOGITS
-même
0.17
alic
0.16
'll
0.16
pherd
0.15
’ll
0.15
/she
0.14
/us
0.14
strcasecmp
0.14
inous
0.13
ela
0.13
Activations Density 0.963%