INDEX
Explanations
instances of false information and declarations
references to falsehoods and deception
New Auto-Interp
Negative Logits
largeDownload
-0.77
zens
-0.74
rared
-0.71
mopolitan
-0.71
leground
-0.69
oval
-0.68
anian
-0.68
ificent
-0.68
idential
-0.67
igree
-0.67
POSITIVE LOGITS
ãĤ¹ãĥĪ
0.78
attribut
0.76
omission
0.75
Prometheus
0.74
miscar
0.72
mistaken
0.71
assumptions
0.70
Ö¼
0.68
excuse
0.65
Loki
0.64
Activations Density 0.412%