INDEX
Explanations
room</MAX_ACTIVATING_TOKENS>
New Auto-Interp
Negative Logits
c
1.09
۔
0.97
'
0.96
ם
0.95
ق
0.88
ف
0.87
га
0.82
’
0.82
cích
0.82
the
0.81
POSITIVE LOGITS
다
1.23
rooms
1.21
ROOM
1.19
el
1.16
ला
1.15
room
1.12
Room
1.05
ет
1.01
ed
0.99
ло
0.97
Activations Density 0.019%