Retrieve file clusters
Private API
This endpoint is available in the Private API only.
VirusTotal has built its own in-house file similarity clustering functionality. At present, this clustering works only on PE, PDF, DOC and RTF files and is based on a very simple structural feature hash. This hash can very often be confused by certain compression and packing strategies, in other words, this clustering logic is no holly grail, yet it has proven itself very useful in the past.
All of the API responses are JSON objects, if no clusters were identified for the given time frame, this JSON will have a response_code property equal to 0, if there was some sort of error with your query this code will be set to -1, if your query succeeded and file similarity clusters were found it will have a value of 1 and the rest of the JSON properties will contain the clustering information.
{
'response_code': 1,
'verbose_msg': 'Clustering information enclosed',
'num_candidates': 184701,
'num_clusters': 44269,
'size_top200': 85160,
'clusters': [
{
'label': 'vilsel [0001]',
'avg_positives': 43,
'id': 'vhash 0740361d051)z1e3z 2013-09-10',
'size': 5712
},
{
'label': 'installer [0032]',
'avg_positives': 7,
'id': 'vhash 02503e0f7d5019z6hz13z1fz 2013-09-10',
'size': 4710
},
{
'label': 'megasearch [0000]',
'avg_positives': 12,
'id': 'vhash 075056651d1d155az639z25z12z14fz 2013-09-10',
'size': 3651
}
]
}
The JSON object returned by this endpoint contains several properties, some of them may not be clear, the following table provides some further insight:
num_candidates | Total number of files submitted during the given time frame for which a feature hash could be calculated. |
num_clusters | Total number of clusters generated for the given time period under consideration, a cluster can be as small as an individual file, meaning that no other feature-wise similar file was found. |
size_top200 | The sum of the number of files in the 200 largest clusters identified. |
clusters | List of JSON objects that contain details about the 200 largest clusters identified. These objects contain 4 properties: id, label, size and avg_positives.. The id field can be used to then query the search API call for files contained in the given cluster. The label property is a verbose human-intelligible name for the cluster. The size field is the number of files that make up the cluster. Finally, avg_positives represents the average number of antivirus detections that the files in the cluster exhibit. |