INDEX
Explanations
concepts related to goodness and morality
good and goodness
New Auto-Interp
Negative Logits
strikingly
-0.49
colari
-0.48
IntoConstraints
-0.48
closely
-0.47
دقی
-0.46
precisely
-0.46
contentLoaded
-0.46
comparatively
-0.45
ctically
-0.45
xrTableCell
-0.45
POSITIVE LOGITS
Good
0.78
Good
0.73
good
0.65
GOOD
0.64
GOOD
0.63
good
0.56
estekak
0.54
Evil
0.52
""],
0.51
好人
0.50
Activations Density 0.033%