Go Back   Proxy List Forum > Site Related > General Chat

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 10-23-2007, 12:01 PM
duane duane is offline
Junior Member
 
Join Date: Oct 2007
Posts: 5
duane
Default Proxy scraper php script

Hi all, I found a script that when works scrapes proxies from various proxy list sites but I can't get it to work. It's probably something simple but I'm not php programmer by far. I'm sure this script would be useful to you guys if you can get it to work.

Code:
<?PHP
  
  //      Use my good old friend the HttpClient class
  include("HttpClient.class.php");
  //      Here is an array of public proxy lists
  $links = array (
     "http://www.publicproxyservers.com/page1.html",
  	"http://www.proxy4free.com/page1.html",
  	"http://www.anonymitychecker.com/page1.html",
  	"http://www.samair.ru/proxy/index.htm",
  	"http://www.samair.ru/proxy/proxy-20.htm");
  //      A very simple regular expression to pull the domain out of url
  $pattern3='/http\:\/\/([a-z-\.0-9]+)\//U';
  foreach ($links as $id1 => $link) {
    preg_match_all($pattern3,$link,$domain);
    //      Make sure each link is going to a new domain
    $client = new HttpClient($domain[1][0]);
    $client->setDebug(false);
    $client->setPersistCookies(true);
    $client->timeout = 20;
    $client->max_redirects = 50;
    //      Each domain needs special needs in regards to cookies
    $client->cookie_host = $domain[1][0];
    $client->setUserAgent('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.3a) Gecko/20021207');
    $client->get($link);
    $proxypage = $client->getContent();
    //      Here is the first regex to pull info from a proxy site.  This pulls in the IP, PORT etc.
    $pattern='/<td.*>([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)<\/td>\n[ ]+<td.*>([0-9]+)<\/td>\n[ ]+
          <td.*>([a-z \+]+)<\/td>\n[ ]+<td.*>([a-zA-Z0-9\(\) ]+)<\/td>/U';
    //      Here is the second regex to pull info from a proxy site.  This pulls in the IP, PORT etc.
    $pattern2='/([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+):([0-9]+)[ ]+([a-z \+]+)([A-Z][A-Za-z]+)/ms';
    preg_match_all($pattern,$proxypage,$proxylist);
    //      The [1] record of proxylist contains results, if it is 0 then first regex failed, go to second one.
    if (count($proxylist[1])==0) {
        unset($proxylist);
        preg_match_all($pattern2,$proxypage,$proxylist);
    }
    //      Run through all the results found from the raw html.
    for($i=0;$i<count($proxylist[1]);$i++) {
	    //      If results do not include anonymous, elite don't add them
        if ((!(strpos($proxylist[3][$i],"anony")===FALSE))||(!(strpos($proxylist[3][$i],"elite")===FALSE))) {
            if (isset($prox)) {
                //      The in_array function of PHP can save time searching for duplicates in a array
                if (!(in_array($proxylist[1][$i], $prox[1]))) {
                    $prox[1][]=$proxylist[1][$i];
                    $prox[2][]=$proxylist[2][$i];
                    $prox[3][]=$proxylist[3][$i];
                    $prox[4][]=$proxylist[4][$i];
                }
            } else {
                $prox[1][]=$proxylist[1][$i];
                $prox[2][]=$proxylist[2][$i];
                $prox[3][]=$proxylist[3][$i];
                $prox[4][]=$proxylist[4][$i];
            }
        }
    }
  }
  //      After all links have been processed, count the final array, and display results.
  $max=count($prox[1]);
  for($i=0;$i<$max;$i++) {
      echo "ip: ".$prox[1][$i]." ";
      echo "port: ".$prox[2][$i]." ";
      echo "type: ".$prox[3][$i]." ";
      echo "country: ".$prox[4][$i]."
";
  }
  echo count($prox[1])."
";
?>
The errors I get are:

Code:
PHP Notice: Undefined variable:  prox in httpdocs/scp.php on line 60
PHP Notice:  Undefined variable:  prox in httpdocs/scp.php on line 67
Anyone got any clues!?!?

Thanks,

Duane.
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT. The time now is 09:15 AM.


Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.0.0 ©2007, Crawlability, Inc.