INDEX
Explanations
words related to negative attributes or consequences
instances of the word "ill" and its variations, indicating a focus on concepts relating to negative health or logical fallacies
New Auto-Interp
Negative Logits
ļéĨĴ
-0.75
*/(
-0.75
EStream
-0.74
uyomi
-0.73
kefeller
-0.73
âĹ¼
-0.67
compr
-0.67
©¶æ¥µ
-0.66
derog
-0.65
EStreamFrame
-0.64
POSITIVE LOGITS
uminati
1.30
inois
1.12
ogical
1.11
umin
1.08
awar
1.08
iberal
1.06
igan
0.99
nesses
0.98
ison
0.98
usive
0.96
Activations Density 0.009%