INDEX
Explanations
phrases indicating absence or removal
New Auto-Interp
Negative Logits
Offensive
-0.17
TestingModule
-0.15
oter
-0.15
;br
-0.15
eldon
-0.15
reverse
-0.14
imin
-0.14
viá»ĩn
-0.14
reverse
-0.14
offense
-0.14
POSITIVE LOGITS
beaten
0.30
grid
0.27
beat
0.25
bat
0.25
cuff
0.24
Grid
0.22
grid
0.21
ensively
0.21
beat
0.21
hook
0.21
Activations Density 0.020%