INDEX
Explanations
instances of colons indicating explanations or lists
phrases related to reasoning and justification
New Auto-Interp
Negative Logits
ascus
-0.76
thur
-0.74
apsed
-0.72
agraph
-0.67
rez
-0.67
é¾
-0.66
Fuck
-0.66
emort
-0.66
ocide
-0.66
cember
-0.65
POSITIVE LOGITS
reducing
1.00
it
0.97
reduces
0.95
facilitating
0.92
lowering
0.89
lowers
0.88
increased
0.88
increases
0.86
they
0.86
eliminating
0.86
Activations Density 0.337%