INDEX
Explanations
variations of the pronoun "it."
New Auto-Interp
Negative Logits
noticed
-0.66
ichever
-0.65
herent
-0.64
mma
-0.61
Friendly
-0.61
dding
-0.60
cknow
-0.59
eligible
-0.58
ighton
-0.55
tted
-0.55
POSITIVE LOGITS
entails
1.01
boils
0.99
hurts
0.96
alian
0.92
happens
0.91
happened
0.90
transpired
0.89
feels
0.88
takes
0.84
mattered
0.83
Activations Density 0.048%