INDEX
Explanations
quotations with attributions
binary responses or indicators of a conclusion
New Auto-Interp
Negative Logits
anwhile
-0.64
avorite
-0.62
jri
-0.62
destro
-0.58
lished
-0.57
withd
-0.56
emale
-0.54
rall
-0.53
etheless
-0.52
essage
-0.52
POSITIVE LOGITS
")
1.03
"]
1.02
"—
1.01
,"
0.97
,'"
0.95
"),
0.95
%"
0.95
.")
0.94
":
0.93
"?
0.93
Activations Density 0.463%