INDEX
Explanations
statements reflecting personal beliefs about morality and justice in society
New Auto-Interp
Negative Logits
ancies
-0.74
pri
-0.65
departing
-0.64
poaching
-0.62
ancy
-0.62
senal
-0.60
originally
-0.60
nen
-0.59
popping
-0.59
ongo
-0.58
POSITIVE LOGITS
Lastly
1.47
Finally
1.41
And
1.19
Lastly
1.09
Finally
1.08
etc
1.07
Whatever
1.01
etc
1.00
Or
0.95
Likewise
0.95
Activations Density 0.230%