INDEX
Explanations
statements expressing moral or ethical dilemmas in combat scenarios
New Auto-Interp
Negative Logits
sorts
-0.21
sort
-0.19
LOTS
-0.16
SORT
-0.16
sort
-0.16
kinds
-0.15
ujet
-0.15
ÄŁinden
-0.14
lots
-0.14
Trivia
-0.14
POSITIVE LOGITS
bulls
0.16
f
0.16
isset
0.15
none
0.15
-,
0.15
ain
0.15
y
0.14
none
0.14
me
0.14
uhan
0.14
Activations Density 0.058%