INDEX
Explanations
words related to distortion or misrepresentation
New Auto-Interp
Negative Logits
ciation
-0.80
ept
-0.77
autions
-0.72
fighters
-0.72
chal
-0.72
occup
-0.71
lication
-0.70
ailability
-0.69
oother
-0.69
alled
-0.68
POSITIVE LOGITS
distorted
0.85
perceptions
0.85
adoes
0.81
distort
0.75
senses
0.75
interpretations
0.74
minds
0.73
disproportion
0.72
mund
0.71
versions
0.70
Activations Density 0.111%