INDEX
Explanations
multiple references to actions related to engagement or involvement in activities or discussions
New Auto-Interp
Negative Logits
2
-0.21
3
-0.20
4
-0.20
5
-0.20
1
-0.20
8
-0.19
6
-0.18
7
-0.18
9
-0.18
10
-0.17
POSITIVE LOGITS
thi
0.58
this
0.56
this
0.42
his
0.40
th
0.38
tb
0.35
this
0.34
his
0.34
THIS
0.33
-this
0.32
Activations Density 0.103%