INDEX
Explanations
concepts related to societal and personal standards of morality and integrity
New Auto-Interp
Negative Logits
iná
-0.16
aight
-0.16
oley
-0.16
ìĬµ
-0.15
ikip
-0.15
KNOWN
-0.14
ÐĴики
-0.14
@Spring
-0.14
галÑĸ
-0.14
ichtig
-0.14
POSITIVE LOGITS
er
0.30
challenged
0.29
threatened
0.28
endangered
0.28
questioned
0.27
-threat
0.25
compromised
0.24
tested
0.23
shaken
0.23
dashed
0.21
Activations Density 0.208%