INDEX
Explanations
thoughts, feelings, fantasies, impulses
New Auto-Interp
Negative Logits
sometimes
0.66
spesso
0.63
גם
0.62
parfois
0.62
*,
0.61
then
0.61
סט
0.60
sometimes
0.60
גם
0.59
ε
0.59
POSITIVE LOGITS
would
0.98
wouldn
0.94
receiving
0.90
would
0.89
receiving
0.88
couldn
0.83
find
0.82
have
0.81
Would
0.80
finden
0.79
Activations Density 0.113%