INDEX
Explanations
mentions of something being helpful
instances of the word "helpful."
New Auto-Interp
Negative Logits
thur
-0.80
jong
-0.71
buck
-0.70
inction
-0.70
agate
-0.70
Dare
-0.69
BU
-0.68
Hop
-0.68
Rush
-0.68
metal
-0.67
POSITIVE LOGITS
helpful
0.86
aide
0.81
aids
0.80
undermin
0.79
guiActiveUn
0.78
introdu
0.75
tip
0.75
glers
0.74
aid
0.74
ãĤĭ
0.73
Activations Density 0.012%