INDEX
Explanations
references to moral concepts and ethical considerations
New Auto-Interp
Negative Logits
xual
-0.79
gow
-0.74
rooms
-0.73
rams
-0.72
lers
-0.70
upon
-0.70
minster
-0.69
Lup
-0.68
abee
-0.67
hips
-0.67
POSITIVE LOGITS
istic
1.15
izing
1.10
hazard
1.06
ising
1.05
compass
1.04
indignation
1.01
obligation
0.98
istically
0.97
ised
0.96
dile
0.96
Activations Density 0.032%