INDEX
Explanations
words related to action or behavior
words that indicate judgment or decision-making processes
New Auto-Interp
Negative Logits
includ
-0.76
poem
-0.68
suffix
-0.67
feat
-0.65
RB
-0.65
tune
-0.64
fit
-0.61
mat
-0.59
vis
-0.58
liberate
-0.57
POSITIVE LOGITS
nces
0.86
ered
0.85
igion
0.81
ased
0.79
ragon
0.77
rals
0.75
wolves
0.75
oll
0.74
uled
0.73
emption
0.73
Activations Density 0.009%