INDEX
Explanations
references to violence, conflict, and geopolitical events
New Auto-Interp
Negative Logits
ĺħ
-0.93
xtap
-0.81
imaru
-0.68
ovie
-0.67
pton
-0.63
pper
-0.63
ube
-0.63
Gra
-0.63
ļé
-0.62
ppe
-0.61
POSITIVE LOGITS
than
1.31
stringent
0.94
than
0.89
importantly
0.89
Than
0.84
sophisticated
0.83
frequent
0.81
rigorous
0.76
ado
0.76
broadly
0.75
Activations Density 0.095%