INDEX
Explanations
references to freedom and patriotism
New Auto-Interp
Negative Logits
ãĥ¼ãĥĨ
-0.75
ħĭ
-0.73
Magikarp
-0.69
organizational
-0.69
ãĥ¼ãĥĨãĤ£
-0.64
informational
-0.64
ij士
-0.63
ãĤ¼ãĤ¦ãĤ¹
-0.62
workplaces
-0.62
undermin
-0.60
POSITIVE LOGITS
↵
1.11
..
0.91
..
0.84
..."
0.83
...
0.80
-"
0.78
↵↵
0.76
//
0.75
*/
0.74
...)
0.74
Activations Density 0.097%