INDEX
Explanations
references to pointing or attribution of blame
phrases or expressions pointing to specific subjects or claims
New Auto-Interp
Negative Logits
lance
-0.80
soever
-0.74
hop
-0.74
stress
-0.73
proxy
-0.73
unte
-0.72
itialized
-0.70
mind
-0.67
DH
-0.67
operates
-0.66
POSITIVE LOGITS
WARD
0.68
TextColor
0.68
odan
0.67
othy
0.67
finger
0.67
Genocide
0.64
obin
0.64
victory
0.62
evidence
0.61
Victory
0.61
Activations Density 0.075%