INDEX
Explanations
phrases that indicate possession or existence
New Auto-Interp
Negative Logits
themſelves
-0.59
myſelf
-0.59
ſelf
-0.58
pleaſure
-0.57
Monfieur
-0.56
houſe
-0.53
Reſ
-0.52
Jefus
-0.51
RegressionTest
-0.51
reaſon
-0.49
POSITIVE LOGITS
stood
0.69
lots
0.65
fewest
0.64
a
0.64
an
0.61
such
0.59
ligiloj
0.59
fewer
0.58
features
0.56
its
0.56
Activations Density 0.476%