INDEX
Explanations
phrases indicating references or attributions to specific concepts or topics
New Auto-Interp
Head Attr Weights
0:0.04
1:0.03
2:0.13
3:0.07
4:0.09
5:0.05
6:0.12
7:0.17
8:0.05
9:0.05
10:0.08
11:0.08
Negative Logits
uniforms
-1.59
cabinets
-1.56
accompanying
-1.45
luggage
-1.43
salads
-1.42
appointments
-1.40
flyer
-1.40
VERTISEMENT
-1.39
robes
-1.39
]}
-1.36
POSITIVE LOGITS
ゴ
1.83
learn
1.70
ブ
1.52
anta
1.49
Questions
1.47
hi
1.46
QUEST
1.46
terness
1.46
raq
1.42
ouple
1.41
Activations Density 0.000%