INDEX
Explanations
language reflecting strong negative emotions, particularly hate, as well as references to specific segments or categories
New Auto-Interp
Negative Logits
seen
-0.39
EDEFAULT
-0.36
Crock
-0.35
commission
-0.35
crock
-0.35
opposition
-0.34
UserScript
-0.34
don
-0.34
doz
-0.33
sesama
-0.33
POSITIVE LOGITS
Segment
0.73
Segment
0.69
threshold
0.66
HtmlAttribute
0.65
hate
0.65
CreateTagHelper
0.62
seuil
0.61
HATE
0.60
Threshold
0.60
segment
0.59
Activations Density 0.074%