INDEX
Explanations
statements that emphasize the importance of facts or factual evidence
New Auto-Interp
Negative Logits
اÙģØª
-0.19
stras
-0.17
lops
-0.15
ÅĽcie
-0.15
dings
-0.15
mts
-0.14
mtree
-0.14
holm
-0.14
indirect
-0.14
ÅĽmy
-0.14
POSITIVE LOGITS
ually
0.30
itious
0.27
fact
0.26
oring
0.26
oid
0.25
ored
0.24
uality
0.24
um
0.23
Fact
0.22
oids
0.20
Activations Density 0.027%