INDEX
Explanations
words related to explanations or justifications
phrases that involve explanations or justifications for beliefs or actions
New Auto-Interp
Negative Logits
lator
-0.90
ymph
-0.77
Roller
-0.75
iece
-0.66
robe
-0.64
ograph
-0.63
aughed
-0.61
wana
-0.60
Juda
-0.60
ãĤ¤ãĥĪ
-0.60
POSITIVE LOGITS
soever
1.01
why
0.80
exactly
0.79
WHY
0.77
why
0.75
abouts
0.73
bother
0.69
they
0.68
eve
0.67
iterranean
0.65
Activations Density 0.039%