INDEX
Explanations
high-stakes situations
sentences describing imminent physical harm or violent scenarios and moral dilemmas about killing (e.g., trolley-problem style situations).
New Auto-Interp
Negative Logits
observations
-0.08
stom
-0.08
eyle
-0.07
pest
-0.07
veil
-0.07
infectious
-0.06
.refresh
-0.06
``
-0.06
knives
-0.06
hik
-0.06
POSITIVE LOGITS
.arr
0.07
郎
0.07
>e
0.07
))+
0.06
,无
0.06
ทอง
0.06
signed
0.06
ี.
0.06
J
0.06
indicated
0.06
Activations Density 0.141%