INDEX
Explanations
profane or offensive language
instances of expletives or strong language
New Auto-Interp
Negative Logits
irms
-0.69
esthesia
-0.66
eting
-0.66
breeze
-0.65
onymous
-0.64
intrusion
-0.64
eca
-0.64
xual
-0.62
spelling
-0.62
acles
-0.62
POSITIVE LOGITS
**
0.99
***
0.90
ãĥĥãĥī
0.79
cause
0.79
@@@@@@@@
0.74
URI
0.73
adr
0.73
********************************
0.73
****
0.72
keeper
0.72
Activations Density 0.050%