INDEX
Explanations
information related to literature or written content
New Auto-Interp
Negative Logits
hatch
-0.65
adventure
-0.65
tyr
-0.61
usable
-0.60
sunny
-0.60
irresistible
-0.58
pir
-0.57
veland
-0.57
lizard
-0.57
emn
-0.56
POSITIVE LOGITS
nor
1.60
yet
1.25
Instead
1.21
nor
1.06
anymore
1.03
Rather
0.99
Nonetheless
0.98
Nor
0.97
Nevertheless
0.94
unless
0.93
Activations Density 4.915%