INDEX
Explanations
hateful ideologies and extremism
New Auto-Interp
Negative Logits
Peso
0.75
నొప్పి
0.73
equilibration
0.73
센
0.72
sensores
0.72
humbling
0.71
nhẹ
0.71
Rubens
0.69
acup
0.69
éné
0.68
POSITIVE LOGITS
ideology
1.03
ideologies
0.98
extremist
0.95
Hate
0.90
Ide
0.90
ide
0.90
hate
0.89
maniac
0.84
hateful
0.84
virulent
0.83
Activations Density 0.738%