INDEX
Explanations
phrases related to investigations or allegations
repeated phrases that indicate allegations of wrongdoing
New Auto-Interp
Negative Logits
uristic
-0.90
ieties
-0.76
ertodd
-0.74
hap
-0.71
folds
-0.70
heses
-0.68
nets
-0.68
Tokens
-0.68
partName
-0.68
keys
-0.68
POSITIVE LOGITS
inacc
0.95
wrongdoing
0.88
harassment
0.80
misconduct
0.75
misinformation
0.74
discrimination
0.71
criminality
0.71
violence
0.71
vandalism
0.70
foul
0.70
Activations Density 0.158%