INDEX
Explanations
themes related to deception, crime, and subversion
New Auto-Interp
Negative Logits
edBy
-0.20
itics
-0.18
اÙĪØ±ÛĮ
-0.16
iatrics
-0.16
iation
-0.16
lessness
-0.16
isation
-0.15
itu
-0.15
izons
-0.15
iliation
-0.14
POSITIVE LOGITS
ulous
0.25
eous
0.24
ous
0.23
ive
0.23
orous
0.22
ful
0.20
ulent
0.20
inous
0.20
ocratic
0.20
itious
0.20
Activations Density 0.140%