INDEX
Explanations
examples or instances mentioned in a text
references to examples and instances in explanations
New Auto-Interp
Negative Logits
ibly
-0.75
resy
-0.72
shaw
-0.69
orship
-0.69
rity
-0.67
cedented
-0.66
lied
-0.65
ses
-0.64
eers
-0.63
Pg
-0.63
POSITIVE LOGITS
suppose
1.44
Suppose
1.24
imagine
1.20
consider
1.11
if
0.98
Imagine
0.89
Consider
0.85
compare
0.80
let
0.79
whereas
0.78
Activations Density 0.135%