INDEX
Explanations
expressions of self-awareness and criticism
New Auto-Interp
Negative Logits
ius
-0.16
ãĥ³ãĥĩãĤ£
-0.15
avra
-0.15
fingert
-0.14
Recorder
-0.14
дап
-0.13
naments
-0.13
pike
-0.13
133
-0.13
onta
-0.13
POSITIVE LOGITS
actor
0.19
makers
0.19
actors
0.18
Leone
0.18
fans
0.18
Actor
0.17
villa
0.17
actress
0.17
Vir
0.17
Pri
0.17
Activations Density 0.080%