INDEX
Explanations
phrases introducing new information or topics
New Auto-Interp
Negative Logits
otherwise
-0.76
isation
-0.71
od
-0.64
morale
-0.64
unit
-0.62
attempts
-0.61
drain
-0.60
lling
-0.60
tyres
-0.60
spiral
-0.59
POSITIVE LOGITS
Here
3.04
Here
2.19
Below
2.09
Let
1.60
Below
1.55
Now
1.43
There
1.41
Again
1.41
here
1.37
Above
1.36
Activations Density 0.009%