INDEX
Explanations
references to fake or fraudulent concepts and entities
New Auto-Interp
Negative Logits
ils
-0.16
shire
-0.16
ally
-0.16
à¸ĵ
-0.15
erable
-0.15
yle
-0.14
оÑĢг
-0.14
Naz
-0.14
iw
-0.14
Nz
-0.14
POSITIVE LOGITS
/false
0.24
.fake
0.19
stin
0.18
fak
0.18
(fake
0.17
busters
0.17
pret
0.16
/mock
0.16
ulence
0.15
eries
0.15
Activations Density 0.027%