INDEX
Explanations
references to monsters and themes of control or dominance
New Auto-Interp
Negative Logits
-0.21
latter
-0.19
elman
-0.17
lessly
-0.16
Latter
-0.15
ulo
-0.15
cred
-0.15
ITED
-0.14
brick
-0.14
538
-0.14
POSITIVE LOGITS
oton
0.20
ingly
0.18
ously
0.17
.Mon
0.17
itored
0.16
ous
0.16
aco
0.16
(mon
0.16
ÑĢаÑī
0.16
odies
0.15
Activations Density 0.046%