INDEX
Explanations
statements about morality and ethics
New Auto-Interp
Negative Logits
Nightmares
-0.67
obbies
-0.65
weights
-0.65
aneers
-0.63
Via
-0.62
Messenger
-0.60
Coach
-0.60
Audit
-0.60
Cors
-0.60
Dreams
-0.59
POSITIVE LOGITS
omorphic
1.17
rael
1.09
olated
1.05
olation
1.03
nt
0.94
olate
0.94
senal
0.90
othermal
0.89
gur
0.88
omorph
0.85
Activations Density 0.110%