INDEX
Explanations
checking for "hello" or "hi"
New Auto-Interp
Negative Logits
n
1.00
d
0.89
o
0.88
0.84
attractions
0.82
g
0.78
v
0.77
cooked
0.75
g
0.74
wine
0.74
POSITIVE LOGITS
vaient
0.96
estratégica
0.89
személy
0.86
decía
0.84
númer
0.82
Thats
0.80
setPreferred
0.79
promot
0.79
<unused331>
0.79
étais
0.78
Activations Density 0.002%