INDEX
Explanations
phrases indicating progress or advancements towards goals
New Auto-Interp
Negative Logits
urban
-0.67
Males
-0.62
Directions
-0.61
followed
-0.59
thia
-0.59
trailing
-0.59
akia
-0.59
Cf
-0.58
commemor
-0.57
escapes
-0.57
POSITIVE LOGITS
notch
1.00
levels
0.98
level
0.98
heights
0.96
brink
0.95
extremes
0.83
absurdity
0.82
depths
0.81
point
0.80
perfection
0.79
Activations Density 0.179%