INDEX
Explanations
phrases related to technology, commentary on social issues, and personal narratives involving conversation and experience
phrases expressing uncertainty or questioning societal norms
New Auto-Interp
Head Attr Weights
0:0.06
1:0.05
2:0.06
3:0.07
4:0.03
5:0.13
6:0.05
7:0.09
8:0.05
9:0.10
10:0.18
11:0.09
Negative Logits
osa
-0.93
装
-0.92
iggins
-0.90
taboola
-0.89
acas
-0.88
acket
-0.87
gain
-0.87
iage
-0.87
knife
-0.85
bryce
-0.85
POSITIVE LOGITS
spoilers
1.11
goddamn
1.03
ain
1.01
selves
1.00
Elven
1.00
apologies
0.98
maybe
0.94
!--
0.92
Allaah
0.90
🙂
0.89
Activations Density 0.474%