INDEX
Explanations
phrases related to beliefs and judgments
concepts related to perception and societal beliefs
New Auto-Interp
Negative Logits
Attempts
-0.68
fax
-0.63
sever
-0.63
ioch
-0.62
swick
-0.62
legram
-0.62
Details
-0.61
sidx
-0.60
Multiple
-0.60
.<
-0.60
POSITIVE LOGITS
somehow
1.02
infall
0.96
superiority
0.84
invincible
0.84
inherently
0.84
magically
0.82
invented
0.73
inferior
0.72
paradise
0.72
immoral
0.70
Activations Density 0.656%