INDEX
Explanations
phrases indicating the impact or influence of actions or events on individuals
phrases indicating the effects or consequences of actions
New Auto-Interp
Negative Logits
bow
-0.70
cele
-0.64
majority
-0.61
ilings
-0.61
mage
-0.58
ban
-0.58
Ce
-0.58
bill
-0.57
forthcoming
-0.57
guide
-0.57
POSITIVE LOGITS
raining
0.86
tremend
0.77
bnb
0.76
alot
0.73
easier
0.69
CTR
0.68
chwitz
0.66
doub
0.65
ometimes
0.63
interesting
0.63
Activations Density 0.222%