INDEX
Explanations
names or labels
phrases that indicate listing or naming items or examples
New Auto-Interp
Negative Logits
ysc
-0.69
entimes
-0.68
loo
-0.67
issance
-0.67
childbirth
-0.63
deterior
-0.63
tail
-0.60
propelled
-0.60
Returns
-0.60
depth
-0.60
POSITIVE LOGITS
culprit
0.86
names
0.85
specific
0.81
perpetrators
0.76
NCT
0.74
blame
0.74
Names
0.73
GROUP
0.72
perpetrator
0.71
particular
0.71
Activations Density 0.317%