INDEX
Explanations
mentions of potential threats or dangers in various contexts
New Auto-Interp
Negative Logits
iao
-0.22
oya
-0.19
ocker
-0.16
icum
-0.15
haul
-0.15
estroy
-0.15
ple
-0.15
Gover
-0.14
ureka
-0.14
ocket
-0.14
POSITIVE LOGITS
posed
0.28
Pos
0.23
ening
0.22
ened
0.22
posed
0.19
hung
0.19
hanging
0.19
å¨ģ
0.18
rical
0.18
perception
0.18
Activations Density 0.036%