INDEX
Explanations
phrases indicating moral judgments or ethical considerations
New Auto-Interp
Negative Logits
ief
-0.16
åĮĸ
-0.14
_VERIFY
-0.14
elm
-0.14
.scalablytyped
-0.14
ney
-0.14
lever
-0.14
099
-0.13
actionTypes
-0.13
λÏī
-0.13
POSITIVE LOGITS
others
0.18
xes
0.16
vice
0.15
weather
0.15
rottle
0.15
likewise
0.15
other
0.14
olis
0.14
similarly
0.14
others
0.14
Activations Density 0.160%