Web (and other) code cross-pollenation

I alluded to this in a previous post.

It’s trivial to spider a site, find all the .jpg|.gif|.bmp|.whatever images and then, if the file name is sufficiently random, google for other sites which may be using the same graphics file. Now, with the release of Google’s codesearch, I can take my searches to a new level. It is my opinion that webserver content has become quite cross-pollenated over the years. And, it’s not just limited to web content…

Lately, I’ve been looking at things like small blocks of code, variable names, and comments within source code. I then use google’s codesearch to find other apps which use the same strings. Here is an example:

route (Mike S.) includes some example C code with his libnet libraries. In libnet-example-1.c, we have:

int network, packet_size, c;
u_long src_ip, dst_ip;

Now, look at this query

And, you see what I’m talking about. Now, that’s marginally interesting. Questions like, who is using Nmap source code in their proprietary scanner or who stole a nessus plugin for their python scanner? can make for interesting water-cooler discussions…

*More* interesting is when you are doing a code audit for a company with their own home-grown apps…and, you find large chunks of open source code within these apps. It’s interesting because:

1) it’s just fun to watch people steal stuff and claim it as their own and

2) when they steal from an open source product that later had bugs, they are in a bit of a quandry because if they piece-mealed from open source, then they aren’t in a position where they can easily patch or upgrade. They have to fix it themselves. For the auditor, finding a closed-source app which contains large amounts of open-source code means that there is a chance that someone has already looked for flaws in that code. Looking for past bugs in the open-source app might lead to finding bugs in the closed-source app. And, that feeds my laziness gene.

I’m gonna automate this process, methinks.

1) go through the source and grab hunks of code which are of a suitable length (we don’t want ‘int i;’, for instance).

2) Lookup that hunk of code on codesearch

3) Report

Other intesting stuff to look for:

1) Order of include files
- did they start with string.h, math.h, stdio.h, and arpa/inet.h?
- do they exactly match order of imports? Do they use separate lines for importing?
- from sys import argv, stdin, exit
- from sys import exit, stdin, argv
- import re,sys,socket,string,time
- Heh. Here’s a funny one. People who steal includes or imports and include files which aren’t ever even needed by the compiler or interpreter :-)

2) order of macros (since we probably wouldn’t evaluate each separate line individually)

3) order of variable initialization

4) binary code converted to another format
- “\x94\x3D\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x20\x43\x4B\x41″
- 0×94,0x3D,0×00,0×00,0×00,0×01,…
- |943D00000001…

5) comments converted to another format (why steal code and bother re-formatting the stolen comments?)
- //Ensure that recv buffer is not NULL
- #Ensure that recv buffer is not NULL
- /*Ensure that recv buffer is not NULL

6) Upper vs lower case and use of spaces within variable init
- File *myINPUT =
- FILE * MyInput =

7) Copyright (or left) information which is commented out.

I’m sure there are a many more ways of finding source horkage. This will be fun.

Unrelated. This is a good month for new security tools.

Pantera . Still barebones, but I like their passive features and when they get around to borrowing Dave’s SPIKE fuzzing routines, this will be nice. Interface is a nice improvement.

New release of SinFP. I love this tool

Metasploit . This project rocks.