剑知北美生活快报   版面列表   admin登录
JiansNet Logo


Alternatives to Nutch and Crawler4J

by JC, published: 2013-03-16 11:08 viewed: 2037 times
想了解更多的美国生活窍门?请订阅: JC写的剑知北美生活快报。
While I made a crawler that could crawl certain sites, I wanted to give a name to it. Initially, I called it SimpleCrawler, but unfortunately found out that there is SimpleCrawler already on the web. It is programmed in Ruby programming language.

So, I renamed my crawler to be called SimpleJCrawler, as it is programmed in Java. I don't plan to make it open source yet, but will make a free version of it for folks to use for specialized crawl of websites. It is cookie based, so it should be able to crawl sites better than Nutch, or Crawler4j.

I will upload the necessary files and then provide a download link and documentation too, stay tuned ;-)

Update:

Here are two competing products and my thought on them.

Apache Nutch Crawler
I've played with Nutch 0.72 and looked at its source code before. However, ever since Nutch 0.8 and above, it is getting significantly complicated, as Nutch is catering more towards Internet crawl/spider rather than Intranet or small scale website crawling. The installation is up to 100MB in file size. It is also difficult to learn for newbies.

Crawler4j
It is open source java based crawler and small and efficient as it is multi-threaded. However, it doesn't support cookie based crawling where SimpleJCrawler would support cookie based crawl.
Comments (21)
1. JC 2012-04-04 22:32
Here is the download link to get the jar files for this java crawler library SimpleJCrawler Version 1 Download. Please note that Version 1 is free for both personal and commercial use.

SimpleJCrawler is a cookie based crawler that can be used to screen-scrape websites and extract sections of text from html pages.
2. JC 2012-04-04 23:15
SimpleJCrawler is a free Java crawler which provides a simple interface for crawling websites. You can setup cookie-enabled crawler in under 5 minutes!

Sample Usage
You need to create two classes:
1) crawler class that will drive the site crawl.
2) collector class that will determine what url to crawl and handle the downloaded page.

The following is a sample implemenation:

TestCrawler class:

import crawl.Crawler;

public class TestCrawler
{
public static void main(String args) throws Exception
{
Crawler crawler = new Crawler();
crawler.addSeed("http://www.6pm.com");
TestCollector c = new TestCollector();
crawler.startCrawl(c, 2); // with a max depth of 2
c.close();
}
}

TestCollector class:

import java.util.regex.Pattern;
import crawl.UrlCollector;

public class TestCollector implements UrlCollector
{
private final static Pattern FILTERS = Pattern.compile(".*(\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
            + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

public TestCollector() throws Exception
{

}

public boolean shouldVisit(String url)
{
        String href = url.toLowerCase();
        if (FILTERS.matcher(href).matches())
        {
         return false;
        }

     if (href.startsWith("http://www.6pm.com/"))
     {
     return true;
     }

        return false;
}

// This function is called when a page is fetched and ready to be processed by your program
public void visit(String url, String html)
{
             System.out.println(url);
}

public void close()
{

}
}

Note:

shouldVisit: This method decides whether the given URL should be crawled or not. In the above example, this example is not allowing .css, .js and media files and only allows pages within www.6pm.com domain.

visit: This method is called after the content of a URL is downloaded successfully. You can easily get the url and html of the downloaded page.

That's it!
3. JC 2012-04-05 10:04
I have to add that the beauty of this simple crawler is that the config is super easy to do. You can run it in Eclipse IDE too.
4. JC 2012-04-18 12:13
The crawler is programmed using httpclient 3.0 package. But now that httpclient 4.0 is out, I am changing it to use httpclient 4.0 instead.

However, there is a drastic difference between version 4.0 and 3.0, so I will test it out first before releasing it.
5. JC 2012-05-31 15:05
I will soon release the 2nd version of this crawler, which probably will have just ONE java library file (.jar) file. Yes, you heard this one right, only 1 jar file.

It will make it super easy for you to use to crawl sites.
6. JC 2012-07-16 10:17
Here is my reply regarding some discussion on the Nutch 2.0 release:

Haven't been actively looking into the latest about Nutch development for couple years now, I may not have the correct insight into it. But, my understanding is that Nutch wants to steer away from only using Lucene as the underlying data storage engine, thus, allowing other implementations as well.

But, for me, I am not a big fan of flexible storage engine. I prefer to just give users/customers one and only one choice, that way, the engine will be very fast. But again, that's just me.

I used to toy with Nutch with crawler projects, but, nowadays, I feel ever since it went from 7.2 to 8.0, it got exponentially complicated. I am not sure all the added benefits would make it a good product though. So, I've been using crawler4j or my own home grown simplejcrawler for example for hobby projects.
7. JC 2012-08-01 00:47
It's been a while that I haven't updated the crawler :-( But, it is time to catch up...

I will work on it in the next week or so and hope to push out the release of the second version...

Thanks for your patience for anyone interested in this little program...
8. ningtao 2012-10-11 22:03
hi,jc:
   I'm want to download your project crawler4j with url 'http://www.box.com/s/dcfaee3ee495af6ddbe8',but this url is useless to visit.Can you send me your project to my email or upload your source code to googlecode?
9. JC 2012-10-11 22:38
@ningtao,
Sorry the link didn't work for you, but I think it should work though. Not sure why. Anyhow, that link would give you only the jar file, not the source code.

Although I made it free for usage, but I didn't intend to release the source code at that time. But as you requested, I think it is time I can just release it as free AND with source code. It will take me couple of days to dust off the source code and put here.

As for google code, I can use that, but I am not very familiar with it yet. I will see what I can do. Stay tuned!
10. ningtao 2012-10-11 23:34
@jc
   jc,thank you replied me so quickly!
   you say will not give me source code.ok!if you want,I really want you to give me you jar file! thanks!
11. ningtao 2012-10-11 23:44
@jc
  jc,I want to say when you detemine to release source code here,Could you send a email to me with source code ? because I think I still can't download your source.
12. JC 2012-10-12 00:24
@ningtao,
Not a problem, I will upload the source code here in the next two days...
13. ningtao 2012-10-12 02:01
@jc
  oh!It's so great!thank you !
14. ningtao 2012-10-12 18:53
hi,jc:
   I viewed my email yesterday but I can't see your email with jar and source code file.Can you give me one today?
15. JC 2012-10-12 21:33
@ningtao,
Sorry not ready yet, check back tomorrow please.
16. JC 2012-10-13 21:06
@ningtao,
Okey, I finally was able to find the source code of it. So you can download it. Let me know if you have any questions for the usage. Thanks,
17. ningtao 2012-10-14 19:10
@jc.
  I have get it,thank you,jc!
18. JC 2012-10-14 22:05
@ningtao,
You are welcome. I will again take the source code offline to make some changes, as I haven't looked at it for a long time and may need to tweak it.

People interested in getting the source code, please email jiansnet@gmail.com and indicate that you would like a copy of the code.
19. Hafilah 2013-03-05 00:48
Hi,
I've tested the codes but the output is still 6pm.com instead of the URL I specified on both TestCollector and TestCrawler classes. I would appreciate if you can help me in solving this matter. Thank you, I look forward to your reply
20. JC 2013-03-05 09:59
@Hafilah,
Did you get code? I will look into it when I got time.
21. JC 2013-03-16 11:08
As there are too many people asking me for the source code, I don't think I have time to handle this anymore.

So, if you would like to have the newer version of the source code as well as tech support, contact me via jiansnet@gmail.com. I may need to charge a fee for the code and tech support.
本文版权属于美国剑知信息网。如需转载,请先同我们联系。
订阅JC写的剑知北美生活快报,您会了解到更多的美国生活窍门。
Related Articles:
• Babylon Virus: Solved
美国Business | 返回顶部 | 返回首页
About Us | Advertise with Us | Privacy Policy
Copyright © 2007-2016, All Rights Reserved.