INDEX
Explanations
informative phrases containing instructions or explanations
phrases that indicate instructional content
New Auto-Interp
Negative Logits
enance
-0.76
oppable
-0.67
enegger
-0.62
threat
-0.62
volent
-0.61
Politics
-0.60
orical
-0.60
Rum
-0.59
AIN
-0.59
orter
-0.58
POSITIVE LOGITS
to
0.92
toget
0.79
--------------------------------------------------------
0.78
semble
0.78
to
0.75
easy
0.74
you
0.73
much
0.72
To
0.71
TO
0.70
Activations Density 0.069%