INDEX
Explanations
phrases expressing strong emotions or reactions
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.09
3:0.04
4:0.11
5:0.15
6:0.25
7:0.02
8:0.12
9:0.04
10:0.05
11:0.04
Negative Logits
affinity
-1.45
corpus
-1.45
geography
-1.43
jurisdiction
-1.41
racuse
-1.35
offence
-1.35
enrol
-1.34
assetsadobe
-1.33
displayText
-1.33
geographic
-1.33
POSITIVE LOGITS
!!!!
1.66
Respect
1.56
laughs
1.56
!!
1.50
Mods
1.47
!!!!!
1.46
神
1.45
Laugh
1.43
!!!
1.42
Laughs
1.41
Activations Density 0.006%