INDEX
Explanations
mentions of the name "Arnold" or variations thereof
New Auto-Interp
Negative Logits
ilig
-0.19
omba
-0.17
ubu
-0.16
insanity
-0.14
desn
-0.14
uzu
-0.14
REP
-0.14
wards
-0.14
Playback
-0.14
anship
-0.14
POSITIVE LOGITS
ussen
0.28
Schwar
0.27
schwar
0.18
Bened
0.17
aldo
0.17
ould
0.17
PRI
0.16
bled
0.16
none
0.15
ceph
0.15
Activations Density 0.020%