feat: expand tool catalog and improve 'what' search recall
- Add 32 new tool and dockerfile entries to README.md catalog. - Increase 'what' shortlist limit to 100 for better search recall. - Update 'what' default model to gemma4 and improve robust JSON parsing.
This commit is contained in:
@@ -92,6 +92,11 @@ Format: `path | goal | usage`. This section is intentionally compact so `what` c
|
|||||||
- `tools/data/between` | goal: print text between delimiters | usage: `tools/data/between START END < file.txt`
|
- `tools/data/between` | goal: print text between delimiters | usage: `tools/data/between START END < file.txt`
|
||||||
- `tools/data/csv_get` | goal: extract selected CSV fields quickly | usage: `tools/data/csv_get file.csv column`
|
- `tools/data/csv_get` | goal: extract selected CSV fields quickly | usage: `tools/data/csv_get file.csv column`
|
||||||
- `tools/data/csv2dot` | goal: turn CSV relationships into Graphviz dot edges | usage: `tools/data/csv2dot`
|
- `tools/data/csv2dot` | goal: turn CSV relationships into Graphviz dot edges | usage: `tools/data/csv2dot`
|
||||||
|
- `config/visidata/plugins/ioc.py` | goal: VisiData plugin for IOC types (domains, URLs, hashes) with VT/MB integration | usage: `vd --plugin config/visidata/plugins/ioc.py ...`
|
||||||
|
- `config/visidata/plugins/iptype.py` | goal: VisiData plugin for IP and CIDR types with enrichment | usage: `vd --plugin config/visidata/plugins/iptype.py ...`
|
||||||
|
- `config/visidata/plugins/lookupcore.py` | goal: Core lookup and caching logic for VisiData plugins | usage: `(internal use)`
|
||||||
|
- `config/visidata/scripts/validate_ioclib.py` | goal: Offline validation for IOC parsing logic | usage: `python3 config/visidata/scripts/validate_ioclib.py`
|
||||||
|
- `config/visidata/scripts/validate_ip_lookups.py` | goal: Offline validation for IP lookup logic | usage: `python3 config/visidata/scripts/validate_ip_lookups.py`
|
||||||
|
|
||||||
### Hashing And Archives
|
### Hashing And Archives
|
||||||
|
|
||||||
@@ -115,6 +120,10 @@ Format: `path | goal | usage`. This section is intentionally compact so `what` c
|
|||||||
- `tools/cloud/speech.py` | goal: run cloud-backed speech or transcription tasks | usage: `python3 tools/cloud/speech.py input`
|
- `tools/cloud/speech.py` | goal: run cloud-backed speech or transcription tasks | usage: `python3 tools/cloud/speech.py input`
|
||||||
- `tools/cloud/vqa3.py` | goal: classify images with a local or model-backed VQA workflow | usage: `python3 tools/cloud/vqa3.py image.jpg`
|
- `tools/cloud/vqa3.py` | goal: classify images with a local or model-backed VQA workflow | usage: `python3 tools/cloud/vqa3.py image.jpg`
|
||||||
- `tools/cloud/youtube_resolve.sh` | goal: resolve direct media URLs from YouTube-like inputs | usage: `tools/cloud/youtube_resolve.sh URL`
|
- `tools/cloud/youtube_resolve.sh` | goal: resolve direct media URLs from YouTube-like inputs | usage: `tools/cloud/youtube_resolve.sh URL`
|
||||||
|
- `tools/dockerpullsave.py` | goal: download Docker images as tarballs without requiring a Docker daemon | usage: `python3 tools/dockerpullsave.py image:tag`
|
||||||
|
- `scripts/proxy/install_proxy.sh` | goal: installer for the Dumb Pipe Proxy Bridge service | usage: `scripts/proxy/install_proxy.sh`
|
||||||
|
- `scripts/proxy/bridge.js` | goal: Node.js proxy bridge with keyring authentication support | usage: `node scripts/proxy/bridge.js`
|
||||||
|
- `scripts/proxy/setup.js` | goal: Interactive setup for storing proxy credentials in the system keyring | usage: `node scripts/proxy/setup.js`
|
||||||
|
|
||||||
### Formats, System, And Text Experiments
|
### Formats, System, And Text Experiments
|
||||||
|
|
||||||
@@ -128,9 +137,28 @@ Format: `path | goal | usage`. This section is intentionally compact so `what` c
|
|||||||
- `tools/system/ltop.py` | goal: show the most frequent lines from a stream like `top` | usage: `tail -f log | python3 tools/system/ltop.py`
|
- `tools/system/ltop.py` | goal: show the most frequent lines from a stream like `top` | usage: `tail -f log | python3 tools/system/ltop.py`
|
||||||
- `tools/system/noerr` | goal: run a command with stderr suppressed | usage: `tools/system/noerr some command`
|
- `tools/system/noerr` | goal: run a command with stderr suppressed | usage: `tools/system/noerr some command`
|
||||||
- `tools/system/wipe.sh` | goal: perform destructive wipe or cleanup steps | usage: `tools/system/wipe.sh target`
|
- `tools/system/wipe.sh` | goal: perform destructive wipe or cleanup steps | usage: `tools/system/wipe.sh target`
|
||||||
|
- `tools/system/copy_firefox_extension.sh` | goal: sync Firefox extensions between profile (e.g. internet to intranet) | usage: `tools/system/copy_firefox_extension.sh [ext_name]`
|
||||||
- `tools/text/probability.py` | goal: run a small text probability experiment | usage: `python3 tools/text/probability.py`
|
- `tools/text/probability.py` | goal: run a small text probability experiment | usage: `python3 tools/text/probability.py`
|
||||||
- `tools/text/depth` | goal: inspect text depth or nesting characteristics | usage: `tools/text/depth input.txt`
|
- `tools/text/depth` | goal: inspect text depth or nesting characteristics | usage: `tools/text/depth input.txt`
|
||||||
|
|
||||||
|
### Container Recipes
|
||||||
|
|
||||||
|
- `dockerfiles/firefox.dockerfile` | goal: Docker recipe for a containerized Firefox with VNC access | usage: `docker build -f dockerfiles/firefox.dockerfile .`
|
||||||
|
- `dockerfiles/kali.dockerfile` | goal: Docker recipe for a Kali Linux base image | usage: `docker build -f dockerfiles/kali.dockerfile .`
|
||||||
|
- `dockerfiles/plaso.dockerfile` | goal: Docker recipe for the Plaso (log2timeline) forensic tool | usage: `docker build -f dockerfiles/plaso.dockerfile .`
|
||||||
|
- `dockerfiles/volatility/Dockerfile` | goal: Docker recipe for Volatility memory forensics | usage: `docker build -f dockerfiles/volatility/Dockerfile .`
|
||||||
|
- `dockerfiles/regripper/Dockerfile` | goal: Docker recipe for RegRipper (Windows registry analysis) | usage: `docker build -f dockerfiles/regripper/Dockerfile .`
|
||||||
|
- `dockerfiles/pdf-analysis/pdf-analysis.dockerfile` | goal: Docker recipe for a PDF analysis environment with peepdf and DidierStevensSuite | usage: `docker build -f dockerfiles/pdf-analysis/pdf-analysis.dockerfile .`
|
||||||
|
- `dockerfiles/flatpdf/Dockerfile` | goal: Docker recipe for a PDF flattening environment using pdftk | usage: `docker build -f dockerfiles/flatpdf/Dockerfile .`
|
||||||
|
- `dockerfiles/tools/clamav.dockerfile` | goal: Docker recipe for a ClamAV scanner | usage: `docker build -f dockerfiles/tools/clamav.dockerfile .`
|
||||||
|
- `dockerfiles/tools/john.dockerfile` | goal: Docker recipe for John the Ripper (Kali-based) | usage: `docker build -f dockerfiles/tools/john.dockerfile .`
|
||||||
|
- `dockerfiles/tools/nmap.dockerfile` | goal: Docker recipe for an Nmap scanner | usage: `docker build -f dockerfiles/tools/nmap.dockerfile .`
|
||||||
|
- `dockerfiles/tools/tcpdump.dockerfile` | goal: Docker recipe for tcpdump packet capture | usage: `docker build -f dockerfiles/tools/tcpdump.dockerfile .`
|
||||||
|
- `dockerfiles/cherokee/cherokee.dockerfile` | goal: Docker recipe for the Cherokee web server | usage: `docker build -f dockerfiles/cherokee/cherokee.dockerfile .`
|
||||||
|
- `dockerfiles/logstash/logstash.conf` | goal: Sample Logstash configuration for various ingestion cases | usage: `docker run -v $(pwd)/logstash.conf:/usr/share/logstash/pipeline/logstash.conf logstash`
|
||||||
|
- `dockerfiles/build_firefox.sh` | goal: Build and run the Firefox VNC container | usage: `dockerfiles/build_firefox.sh`
|
||||||
|
- `dockerfiles/build_kali.sh` | goal: Build the Kali Linux Docker image | usage: `dockerfiles/build_kali.sh`
|
||||||
|
|
||||||
### CTF Helpers
|
### CTF Helpers
|
||||||
|
|
||||||
- `tools/ctf/filtertext.py` | goal: filter challenge text to useful fragments | usage: `python3 tools/ctf/filtertext.py input.txt`
|
- `tools/ctf/filtertext.py` | goal: filter challenge text to useful fragments | usage: `python3 tools/ctf/filtertext.py input.txt`
|
||||||
|
|||||||
@@ -21,7 +21,7 @@ from pathlib import Path
|
|||||||
|
|
||||||
REPO_ROOT = Path(__file__).parent.resolve()
|
REPO_ROOT = Path(__file__).parent.resolve()
|
||||||
README_PATH = REPO_ROOT / "README.md"
|
README_PATH = REPO_ROOT / "README.md"
|
||||||
DEFAULT_MODEL = os.environ.get("WHAT_OLLAMA_MODEL", "ministral-3:3b")
|
DEFAULT_MODEL = os.environ.get("WHAT_OLLAMA_MODEL", "gemma4")
|
||||||
CATALOG_HEADING = "## Tool Catalog"
|
CATALOG_HEADING = "## Tool Catalog"
|
||||||
ENTRY_RE = re.compile(
|
ENTRY_RE = re.compile(
|
||||||
r"^- `([^`]+)` \| goal: (.*?) \| usage: (.*)$"
|
r"^- `([^`]+)` \| goal: (.*?) \| usage: (.*)$"
|
||||||
@@ -123,13 +123,20 @@ def build_prompt(query: str, entries: list[dict[str, str]]) -> str:
|
|||||||
return f"""You are selecting tools from a repository catalog.
|
return f"""You are selecting tools from a repository catalog.
|
||||||
Use only the catalog below. Prefer direct matches. Use archived tools only if they clearly fit the request.
|
Use only the catalog below. Prefer direct matches. Use archived tools only if they clearly fit the request.
|
||||||
|
|
||||||
Return strict JSON only. The response must be a JSON array with up to 8 objects.
|
Return strict JSON matching this schema exactly:
|
||||||
Each object must contain:
|
{{
|
||||||
- "path": exact catalog path
|
"results": [
|
||||||
- "reason": one short sentence
|
{{
|
||||||
|
"path": "exact catalog path",
|
||||||
|
"reason": "one short sentence explaining why this tool matches"
|
||||||
|
}}
|
||||||
|
]
|
||||||
|
}}
|
||||||
|
|
||||||
Do not invent paths. Do not include markdown.
|
Constraints:
|
||||||
Prefer the entry whose action best matches the query: compare beats hash for comparison queries, open beats convert for opening queries, and mount beats inspect for mount queries.
|
- The "results" array must contain up to 8 objects.
|
||||||
|
- Do not invent paths.
|
||||||
|
- Prefer the entry whose action best matches the query: compare beats hash for comparison queries, open beats convert for opening queries, and mount beats inspect for mount queries.
|
||||||
|
|
||||||
Query: {query}
|
Query: {query}
|
||||||
|
|
||||||
@@ -142,7 +149,7 @@ def tokenize(text: str) -> set[str]:
|
|||||||
return set(TOKEN_RE.findall(text.lower()))
|
return set(TOKEN_RE.findall(text.lower()))
|
||||||
|
|
||||||
|
|
||||||
def shortlist_entries(query: str, entries: list[dict[str, str]], limit: int = 28) -> list[dict[str, str]]:
|
def shortlist_entries(query: str, entries: list[dict[str, str]], limit: int = 100) -> list[dict[str, str]]:
|
||||||
query_tokens = tokenize(query)
|
query_tokens = tokenize(query)
|
||||||
if not query_tokens:
|
if not query_tokens:
|
||||||
return entries[:limit]
|
return entries[:limit]
|
||||||
@@ -163,28 +170,45 @@ def shortlist_entries(query: str, entries: list[dict[str, str]], limit: int = 28
|
|||||||
|
|
||||||
|
|
||||||
def extract_json_array(output: str) -> list[dict[str, str]]:
|
def extract_json_array(output: str) -> list[dict[str, str]]:
|
||||||
match = re.search(r"\[\s*\{.*\}\s*\]", output, re.DOTALL)
|
# Step 1: Clean and find the root object boundary if Ollama prefixes anything
|
||||||
|
match = re.search(r"\{\s*.*\}\s*", output, re.DOTALL)
|
||||||
payload = match.group(0) if match else output
|
payload = match.group(0) if match else output
|
||||||
|
|
||||||
data = json.loads(payload)
|
try:
|
||||||
if not isinstance(data, list):
|
# ALLOW literal newlines/control characters inside string properties
|
||||||
raise WhatError("Model output must be a JSON array.")
|
data = json.loads(payload, strict=False)
|
||||||
|
except json.JSONDecodeError as exc:
|
||||||
|
raise WhatError(f"Failed to parse model output as JSON: {exc}")
|
||||||
|
|
||||||
|
if not isinstance(data, dict):
|
||||||
|
raise WhatError("Model output must be a root JSON object.")
|
||||||
|
|
||||||
|
# Step 2: Safe navigation into the expected schema array
|
||||||
|
results_list = data.get("results")
|
||||||
|
if results_list is None:
|
||||||
|
raise WhatError("Missing 'results' key in model JSON response.")
|
||||||
|
|
||||||
|
if not isinstance(results_list, list):
|
||||||
|
raise WhatError("The 'results' property must be a JSON array.")
|
||||||
|
|
||||||
|
# Step 3: Extract and normalize items
|
||||||
normalized: list[dict[str, str]] = []
|
normalized: list[dict[str, str]] = []
|
||||||
for item in data:
|
for item in results_list:
|
||||||
if not isinstance(item, dict):
|
if not isinstance(item, dict):
|
||||||
continue
|
continue
|
||||||
path = str(item.get("path", "")).strip()
|
path = str(item.get("path", "")).strip()
|
||||||
reason = str(item.get("reason", "")).strip()
|
# Clean up any literal newlines the model injected into the text
|
||||||
|
reason = str(item.get("reason", "")).replace("\n", " ").strip()
|
||||||
if path:
|
if path:
|
||||||
normalized.append({"path": path, "reason": reason})
|
normalized.append({"path": path, "reason": reason})
|
||||||
|
|
||||||
return normalized
|
return normalized
|
||||||
|
|
||||||
|
|
||||||
def run_ollama_once(prompt: str, model: str) -> str:
|
def run_ollama_once(prompt: str, model: str) -> str:
|
||||||
try:
|
try:
|
||||||
result = subprocess.run(
|
result = subprocess.run(
|
||||||
["ollama", "run", model, prompt],
|
["ollama", "run", "--format", "json", "--hidethinking", model, prompt],
|
||||||
capture_output=True,
|
capture_output=True,
|
||||||
text=True,
|
text=True,
|
||||||
timeout=60,
|
timeout=60,
|
||||||
@@ -206,8 +230,9 @@ def run_ollama(prompt: str, model: str) -> list[dict[str, str]]:
|
|||||||
return extract_json_array(first_output)
|
return extract_json_array(first_output)
|
||||||
except (json.JSONDecodeError, WhatError):
|
except (json.JSONDecodeError, WhatError):
|
||||||
repair_prompt = (
|
repair_prompt = (
|
||||||
"Rewrite the following response as strict JSON only.\n"
|
"Rewrite the following response as strict JSON matching the target schema.\n"
|
||||||
'Target format: [{"path":"exact catalog path","reason":"short reason"}]\n'
|
"Target format:\n"
|
||||||
|
'{\n "results": [{"path":"exact catalog path","reason":"short reason"}]\n}\n'
|
||||||
"Do not add markdown or commentary.\n\n"
|
"Do not add markdown or commentary.\n\n"
|
||||||
f"Response to repair:\n{first_output}\n"
|
f"Response to repair:\n{first_output}\n"
|
||||||
)
|
)
|
||||||
@@ -300,4 +325,4 @@ def main() -> int:
|
|||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
raise SystemExit(main())
|
raise SystemExit(main())
|
||||||
Reference in New Issue
Block a user