INDEX
Explanations
references to potential dangers or threats
New Auto-Interp
Negative Logits
.modules
-0.16
-0.14
ãģ³
-0.13
ocker
-0.13
961
-0.13
stood
-0.13
apparent
-0.13
ÏĦαÏĤ
-0.13
dr
-0.13
nearly
-0.13
POSITIVE LOGITS
kdyby
0.19
potentially
0.18
might
0.17
possibly
0.17
Might
0.16
æĪĸèĢħ
0.16
Harm
0.16
possibly
0.16
might
0.16
ogle
0.15
Activations Density 0.218%