INDEX
Explanations
offensive language and insults
derogatory terms and insults directed at individuals
New Auto-Interp
Negative Logits
HCR
-0.79
srf
-0.74
rall
-0.71
BLIC
-0.71
ITED
-0.71
ONT
-0.70
ctica
-0.70
clerosis
-0.70
Pradesh
-0.70
isman
-0.69
POSITIVE LOGITS
bitch
0.83
buster
0.82
posts
0.81
iness
0.78
post
0.77
fest
0.76
ings
0.75
dump
0.74
umin
0.73
enger
0.72
Activations Density 0.014%