INDEX
Explanations
questions starting with "How did" or "Why did"
questions starting with "How" or "Did."
New Auto-Interp
Negative Logits
houses
-0.74
heter
-0.72
washer
-0.71
Methods
-0.71
thur
-0.70
rooms
-0.70
arters
-0.69
room
-0.68
atten
-0.68
north
-0.68
POSITIVE LOGITS
actic
1.01
iosyncr
0.82
netflix
0.77
nt
0.68
IER
0.68
originate
0.68
ĸļ
0.67
not
0.66
riks
0.65
undergo
0.65
Activations Density 0.042%