Tools

SubCrawl – A Modular Framework For Discovering Open Directories, Identifying Unique Content Through Signatures And Organizing The Data With Optional Output Modules, Such As MISP

October 23, 2021

SubCrawl is a framework developed by

However, if this UI is not sufficient for the subsequent evaluation of the data, the MISP storage module can be activated alternatively or additionally. The corresponding settings must be made in config.yml under the MISP section.

The following two commands are enough to clone the GIT repository, create the Docker container and start it directly. Afterwards the web UI can be reached at the address https://localhost:8000/. Please note, once the

To add additional YARA rules, you can add .YAR files to the yara-rules folder, and then include the rule file by adding an include statement to combined-rules.yar.

ClamAV

The ClamAV processing module is used to scan HTTP response content during scanning with ClamAV. If a match is found, it is provided to the various output modules. To invoke this processing module, provide the value ClamAVProcessing as a processing module argument. For example, the following command will load the ClamAV processing module and produce output to the console via the ConsoleStorage storage module.

python3 subcrawl.py -p ClamAVProcessing -s ConsoleStorage

Sample output:

To utilize this module, ClamAV must be installed. From a terminal, install ClamAV using the APT package manager:

$ sudo apt-get install clamav-daemon clamav-freshclam clamav-unofficial-sigs

Once installed, the ClamAV update service should already be running. However, if you want to manually update using freshclam, ensure that the service is stopped:

sudo systemctl stop clamav-freshclam.service

And then run freshclam manually:

$ sudo freshclam

Finally, check the status of the ClamAV service:

$ sudo systemctl status clamav-daemon.service

If the service is not running, you can use systemctl to start it:

$ sudo systemctl start clamav-daemon.service

Payload

The Payload processing module is used to identify HTTP response content using the libmagic library. Additionally, SubCrawl can be configured to save content of interest, such as PE files or archives. To invoke this processing module, provide the value PayloadProcessing as a processing module argument. For example, the following command will load the Payload processing module and produce output to the console:

python3 subcrawl.py -p PayloadProcessing -s ConsoleStorage

There are no additional dependencies for this module.

Sample output:

Storage Modules

Storage modules are called by the SubCrawl engine after all URLs from the queue have been scanned. They were designed with two objectives in mind. First, to obtain the results from scanning immediately after finishing the scan queue and secondly to enable long-term storage and analysis. Therefore we not only implemented a ConsoleStorage module but also an integration for MISP and an SQLite storage module.

Console

To quickly analyse results directly after scanning URLs, a well-formatted output is printed to the console. This output is best suited for when SubCrawl is used in run-once mode. While this approach worked well for scanning single domains or generating quick output, it is unwieldy for long-term research and analysis.

SQLite

Since the installation and configuration of MISP can be time-consuming, we implemented another module which stores the data in an SQLite database. To present the data to the user as simply and clearly as possible, we also developed a simple web GUI. Using this web application, the scanned domains and URLs can be viewed and searched with all their attributes. Since this is only an early version, no complex comparison features have been implemented yet.

MISP

Building your own Modules

Templates for processing and storage modules are provided as part of the framework.

Processing Modules

Processing modules can be found under crawler->processing and a sample module file example_processing.py found in this directory. The template provides the necessary inheritance and imports to ensure execution by the framework. The init function provides for module initialization and receives an instance of the logger and the global configuration. The logger is used to provide logging information from the processing modules, as well as throughout the framework.

The process function is implemented to process each HTTP response. To this end, it receives the URL and the raw response content. This is where the work of the module is implemented. This function should return a dictionary with the following fields:

hash: the sha256 of the content
url: the URL the content was retrieved from
matches: any matching results in the module, For example, libmagic or YARA results.

A unique class name must be defined and is used to define this module when including it via the -p argument or as a default processing module in the configuration file.

Finally, add an import statement in __init__.py, using your class name:

from .<REPLACE>_processing import <REPLACE>Processing

Storage Modules

Storage modules can be found under crawler->storage and a sample module file example_storage.py found in this directory. Similar to the processing modules, init function provides for module initialization and receives an instance of the logger and the global configuration. The store_results function receives structured data from the engine at intervals defined by the batch size in the configuration file.

A unique class name must be defined and is used to load the module when including it via the -s argument or as a default processing module in the configuration file.

Presentations and Other Resources

2021:

BlackHat Arsenal USA
VirusBulletin Localhost – Upcoming

License

SubCrawl is licensed under the MIT license

Download Subcrawl

If you like the site, please consider joining the telegram channel or supporting us on Patreon using the button below.

SubCrawl – A Modular Framework For Discovering Open Directories, Identifying Unique Content Through Signatures And Organizing The Data With Optional Output Modules, Such As MISP

CVE Alert: CVE-2025-47479

CVE Alert: CVE-2025-39487

CVE Alert: CVE-2025-47565

CVE Alert: CVE-2025-49247

CVE Alert: CVE-2025-32311

Original Source

You may have missed

CVE Alert: CVE-2025-47479

CVE Alert: CVE-2025-39487

CVE Alert: CVE-2025-47565

CVE Alert: CVE-2025-49247

CVE Alert: CVE-2025-32311