INDEX
Explanations
phrases and questions related to thoughts or considerations
references to personal experiences and inquiries about actions
New Auto-Interp
Negative Logits
Brush
-0.66
tops
-0.66
ãĥĥ
-0.62
oubt
-0.62
amy
-0.61
Bliss
-0.61
itude
-0.61
AIR
-0.60
ACY
-0.59
Frazier
-0.59
POSITIVE LOGITS
ribed
0.82
fared
0.78
unfolded
0.77
structured
0.77
behave
0.75
interpreted
0.74
stacked
0.73
manifests
0.70
behaved
0.69
differs
0.68
Activations Density 0.177%