INDEX
Explanations
warnings and disclaimers in texts
warnings about explicit or graphic content in media
New Auto-Interp
Negative Logits
inguished
-0.71
luaj
-0.69
Redditor
-0.68
sonian
-0.67
elligent
-0.66
Hon
-0.66
annie
-0.65
srfAttach
-0.65
SPONSORED
-0.64
itage
-0.63
POSITIVE LOGITS
*)
0.94
spoilers
0.91
assumes
0.84
!]
0.84
OIL
0.83
ALWAYS
0.83
.)
0.79
formatting
0.76
)*
0.76
RAW
0.76
Activations Density 0.346%