INDEX
Explanations
ill-formed words
words related to slurs or derogatory terms
New Auto-Interp
Negative Logits
ICLE
-0.78
ãģ®éŃĶ
-0.76
IFIC
-0.74
enegger
-0.73
Cause
-0.68
éļ
-0.65
BLIC
-0.65
Realms
-0.64
Trust
-0.63
terms
-0.63
POSITIVE LOGITS
anted
1.19
udge
1.16
ugg
1.16
otted
1.14
ights
1.13
asher
1.12
ashes
1.12
ither
1.12
inging
1.11
ipp
1.11
Activations Density 0.010%