INDEX
Explanations
phrases indicating expectations or predictions regarding outcomes
New Auto-Interp
Negative Logits
-FIRST
-0.14
wat
-0.12
urette
-0.12
ниÑĤ
-0.12
analyze
-0.12
vise
-0.12
çĸ²
-0.12
trys
-0.12
ynchronize
-0.12
atta
-0.12
POSITIVE LOGITS
grace
0.26
headline
0.23
top
0.20
duke
0.19
land
0.19
rank
0.19
rival
0.19
rule
0.18
outs
0.18
command
0.18
Activations Density 0.207%