INDEX
Explanations
phrases or terms related to geopolitical conflicts or controversies
demonstrative or relative pronouns indicating specific entities or groups
New Auto-Interp
Negative Logits
Returns
-0.73
Untitled
-0.72
laughs
-0.68
uces
-0.62
increments
-0.62
prints
-0.61
stays
-0.61
Alright
-0.61
wheels
-0.60
lyss
-0.60
POSITIVE LOGITS
were
1.15
include
1.04
have
1.00
constitute
0.99
are
0.98
comprise
0.96
weren
0.94
violate
0.94
allege
0.93
dominate
0.89
Activations Density 0.200%