INDEX
Explanations
phrases starting with "What" followed by a question
the word "What" as a questioning prompt
New Auto-Interp
Negative Logits
heter
-0.64
atory
-0.55
stretched
-0.55
mun
-0.54
lique
-0.54
MER
-0.53
println
-0.53
general
-0.52
udi
-0.52
lined
-0.52
POSITIVE LOGITS
soever
1.31
happens
1.23
Happ
1.14
Makes
1.12
constitutes
1.05
happened
1.03
Causes
1.01
Exactly
0.99
Does
0.98
Lies
0.98
Activations Density 0.051%