INDEX
Explanations
the letters 't' with a high activation value
negations and the word "not."
New Auto-Interp
Negative Logits
Reloaded
-0.72
Passenger
-0.65
ħĭ
-0.65
descent
-0.64
Penguin
-0.61
Pike
-0.60
Seah
-0.59
behavi
-0.59
Palestin
-0.59
çĦ
-0.59
POSITIVE LOGITS
ween
1.01
reprene
0.93
unes
0.92
une
0.91
urb
0.82
urtles
0.82
weet
0.82
aper
0.81
ruly
0.80
UNE
0.78
Activations Density 0.113%