INDEX
Explanations
occurrences of the word "one"
New Auto-Interp
Negative Logits
zos
-0.74
keepers
-0.72
LESS
-0.68
ocracy
-0.65
cats
-0.62
letters
-0.61
ursed
-0.61
keeper
-0.60
brates
-0.59
IVES
-0.58
POSITIVE LOGITS
glance
1.15
behest
0.89
moment
0.88
apiece
0.86
point
0.85
point
0.83
expense
0.81
end
0.78
instance
0.78
level
0.77
Activations Density 0.009%