INDEX
Explanations
references to grenades or explosive devices
New Auto-Interp
Negative Logits
Infinity
-0.16
curacy
-0.14
redi
-0.14
enko
-0.14
untu
-0.14
ua
-0.14
chap
-0.14
Attribution
-0.13
anning
-0.13
ieg
-0.13
POSITIVE LOGITS
125
0.16
iaux
0.15
shal
0.15
ninger
0.14
ubber
0.14
126
0.14
.grp
0.14
.lesson
0.13
izona
0.13
iami
0.13
Activations Density 0.002%