Security Audit
evaluating-code-models
github.com/davila7/claude-code-templatesTrust Assessment
evaluating-code-models received a trust score of 59/100, placing it in the Caution category. This skill has some security considerations that users should review before deployment.
SkillShield's automated analysis identified 4 findings: 1 critical, 0 high, 1 medium, and 2 low severity. Key findings include Network egress to untrusted endpoints, Covert behavior / concealment directives, Direct execution of untrusted code on host system via `--allow_code_execution` and `--trust_remote_code`.
The analysis covered 4 layers: Manifest Analysis, Static Code Analysis, Dependency Graph, LLM Behavioral Safety. The LLM Behavioral Safety layer scored lowest at 68/100, indicating areas for improvement.
Last analyzed on February 12, 2026 (commit 458b1186). SkillShield performs automated 4-layer security analysis on AI skills and MCP servers.
Layer Breakdown
Behavioral Risk Signals
Security Findings4
| Severity | Finding | Layer | Location | |
|---|---|---|---|---|
| CRITICAL | Direct execution of untrusted code on host system via `--allow_code_execution` and `--trust_remote_code` The skill instructs the user to run `accelerate launch main.py` with the `--allow_code_execution` flag, which enables the execution of code generated by the evaluated model directly on the host system. Since the generated code is untrusted, this creates a critical command injection vulnerability where a malicious model could generate and execute arbitrary code. Furthermore, the `--trust_remote_code` flag, used for custom/private models, allows loading and executing arbitrary Python code from a remote HuggingFace model repository or local path. If the specified model is malicious or compromised, this flag enables the execution of that malicious code on the host. While Docker is suggested for some workflows as a mitigation, it is not universally enforced for all instances where these flags are used, leaving the host system vulnerable. Always execute untrusted code (model generations or remote model code) in an isolated, sandboxed environment (e.g., a dedicated Docker container, VM, or secure sandbox service) with minimal permissions. Avoid using `--trust_remote_code` unless the source is absolutely trusted and verified. If `--trust_remote_code` is necessary, ensure the environment is fully sandboxed. | LLM | SKILL.md:60 | |
| MEDIUM | Network egress to untrusted endpoints HTTP request to raw IP address Review all outbound network calls. Remove connections to webhook collectors, paste sites, and raw IP addresses. Legitimate API calls should use well-known service domains. | Manifest | cli-tool/components/mcps/devtools/figma-dev-mode.json:4 | |
| LOW | Covert behavior / concealment directives Multiple zero-width characters (stealth text) Remove hidden instructions, zero-width characters, and bidirectional overrides. Skill instructions should be fully visible and transparent to users. | Manifest | cli-tool/components/mcps/devtools/jfrog.json:4 | |
| LOW | Unpinned `bigcode-evaluation-harness` dependency The `bigcode-evaluation-harness` dependency is not pinned to a specific version in the manifest. This allows `pip install` to fetch the latest version, which could introduce unexpected behavior, breaking changes, or potentially malicious code if the upstream repository is compromised. While other dependencies use `>=` for versioning, the primary package being unpinned is a higher risk. Pin the `bigcode-evaluation-harness` dependency to a specific, known-good version (e.g., `"bigcode-evaluation-harness==X.Y.Z"`) to ensure deterministic builds and reduce the risk of unexpected changes or supply chain attacks. | LLM | SKILL.md:1 |
Scan History
Embed Code
[](https://skillshield.io/report/b5626b3cf1a9aff2)
Powered by SkillShield