INDEX
Explanations
values or qualities such as kindness, fear, love, and cooperation
concepts related to morality and societal values
New Auto-Interp
Negative Logits
senal
-0.67
ERSON
-0.65
CFR
-0.64
pestic
-0.63
referen
-0.63
everal
-0.59
amins
-0.58
ahon
-0.58
ancest
-0.58
-0.57
POSITIVE LOGITS
fulness
0.95
lessness
0.87
cellence
0.84
ism
0.83
itself
0.74
equals
0.73
alone
0.73
less
0.71
beg
0.71
ful
0.70
Activations Density 0.413%