INDEX
Explanations
pronouns referring to individuals and their actions or knowledge
phrases related to awareness and knowledge of situations
New Auto-Interp
Negative Logits
Fal
-0.71
wik
-0.67
gard
-0.65
Vaughn
-0.63
Howell
-0.63
Yel
-0.62
dexter
-0.62
Eps
-0.61
Conquest
-0.61
Torrent
-0.60
POSITIVE LOGITS
self
0.92
selves
0.91
LO
0.80
agos
0.80
rogen
0.72
could
0.70
gotta
0.69
OLD
0.68
OTA
0.68
own
0.68
Activations Density 0.226%