INDEX
Explanations
words that start with the letter 'w'
the letter "w"
New Auto-Interp
Negative Logits
depri
-0.75
deprived
-0.74
phy
-0.71
succeeding
-0.71
egal
-0.69
HAEL
-0.68
oresc
-0.67
culp
-0.66
displayText
-0.66
pora
-0.66
POSITIVE LOGITS
ither
1.05
avy
0.98
pn
0.96
atson
0.96
atts
0.94
wn
0.92
ithering
0.90
itty
0.88
irts
0.88
avering
0.88
Activations Density 0.010%