INDEX
Explanations
references to evidence or proof
New Auto-Interp
Negative Logits
ategory
-0.81
Hop
-0.73
ttle
-0.73
aeper
-0.71
ernel
-0.70
scill
-0.68
iery
-0.68
otom
-0.65
Chop
-0.65
throats
-0.64
POSITIVE LOGITS
evidence
1.06
evidence
1.02
tampering
0.96
corrobor
0.93
linking
0.91
indicating
0.90
Evidence
0.90
evid
0.87
demonstrating
0.86
suggesting
0.86
Activations Density 0.537%