INDEX
Explanations
phrases including the word "as"
New Auto-Interp
Negative Logits
itſelf
-0.81
Jefus
-0.74
pleaſure
-0.71
juſt
-0.64
ſever
-0.63
becauſe
-0.63
こと
-0.63
Anſ
-0.62
Conſ
-0.62
ſelf
-0.61
POSITIVE LOGITS
follows
1.11
well
1.08
opposed
1.08
part
1.02
soon
1.01
a
0.98
follows
0.95
pires
0.93
far
0.90
much
0.86
Activations Density 0.329%