INDEX
Explanations
phrases related to negative actions and behaviors towards others
conjunctions and phrases indicating an ongoing relationship or connection between ideas
New Auto-Interp
Negative Logits
Yao
-0.71
Shrine
-0.64
Animation
-0.63
Romans
-0.62
Goo
-0.62
Colts
-0.62
NRL
-0.61
Sox
-0.60
Dungeons
-0.59
Shots
-0.59
POSITIVE LOGITS
rogen
1.06
rogens
1.04
distribute
0.85
punish
0.83
rehabilit
0.83
humili
0.80
analyse
0.79
rew
0.79
manipulate
0.78
dispose
0.77
Activations Density 0.131%