A Quantum Immortal: Dotnet Web Crawler Speedup

My blog has moved! Redirecting...

You should be automatically redirected. If not, visit http://ripper234.com and update your bookmarks.

09 August 2007

Dotnet Web Crawler Speedup

I'm writing a web crawler in C#, and getting it to perform well was really annoying.
I tried simply using ThreadPool.QueueUserWorkItem() to queue up my requests to multiple threads. Each thread just ran WebClient.DownloadString().

While the threads did run in parallel, it turned out WebClient had an inherent lock.
I tried messing with the ConnectionManagementSection, but that turned out read-only.
After some Google, I found that the configuration can only be changed by modifying the machine.config or user.config files! Seems pretty stupid to me.

After doing that simply didn't work either, I found this code that helped me through. I still don't know exactly why WebClient.DownloadString() doesn't work, but after some tweaking I got to about 2.5 pages pre second. Still not top speed, but way better than the 0.5 pages/second I started with.

No comments:

Subscribe to: Post Comments (Atom)

Tag Cloud

Humor (45)

Gaming (20)

Programming (15)

Google (13)

Computer Science (12)

Wikipedia (9)

Blogging (8)

C# (8)

Biology (7)

Good (7)

RSS (7)

Cool (6)

Starcraft (6)

Science (5)

Thesis (5)

Aya (4)

Dot Net (4)

Evil (4)

Gmail (4)

Microsoft (4)

Religion (4)

Stupid (4)

Comics (3)

Crappy Graphs (3)

Delver (3)

Facebook (3)

Firefox (3)

Gadgets (3)

Graphs (3)

Magic (3)

Mathematics (3)

Open Source (3)

Resharper (3)

Robotics (3)

Snowboard (3)

Spam (3)

TeamCity (3)

Unit Testing (3)

xkcd (3)

AI (2)

Academia (2)

Army (2)

Birthday (2)

Books (2)

Cats (2)

Cellular (2)

Cryptography (2)

Death (2)

Evolution (2)

Experiments (2)

Extreme Programming (2)

Geeks (2)

Hacking (2)

Hebrew (2)

Herzelia (2)

Image Processing (2)

Israel (2)

JetBrains (2)

Knuth (2)

Latex (2)

Links (2)

MovieLens (2)

Movies (2)

Moving (2)

Physics (2)

Politics (2)

Real Time Strategy (2)

Refactoring (2)

Scary (2)

Search (2)

Security (2)

Simpsons (2)

Software (2)

Stackoverflow (2)

Turing (2)

Twitter (2)

Virus (2)

Visual Studio (2)

Weird (2)

Wiki (2)

World of Warcraft (2)

fuck (2)

shit (2)

1994 (1)

23 (1)

24 (1)

AIDS (1)

ASP (1)

Algorithms (1)

Alpha (1)

Art (1)

BBQ (1)

BSOD (1)

Baby (1)

Bad (1)

Banner (1)

Batman (1)

Best Blonde Joke Ever (1)

Beta (1)

Binary (1)

Bionic (1)

Bitch (1)

Black Holes (1)

Blind (1)

Blogger (1)

Blue Orb (1)

Blue Screen (1)

Boot Loop (1)

Borg (1)

Bullshit (1)

COM (1)

Cars (1)

Classifier (1)

Complexity (1)

Computer (1)

Cops (1)

Corrupt (1)

Crash (1)

Cyber Crime (1)

Dark (1)

Dating (1)

Dawkins (1)

Denmark (1)

Desktop (1)

Dialog in the Dark (1)

Digg (1)

Disaster (1)

Diving (1)

Dogs (1)

Doom (1)

Doomsday (1)

Duke Nukem (1)

Economy (1)

Eilat (1)

Embarrassing (1)

Fake (1)

Fan (1)

Feedburner (1)

Feedreader (1)

Fibonacci (1)

Finite Automata (1)

Flickr (1)

Food (1)

Forum (1)

France (1)

Free (1)

Free Rice (1)

Fruits (1)

Frustration (1)

Fumble (1)

Gangsters (1)

Gaza (1)

God (1)

Goog Luck (1)

Google Notebook (1)

Google Page Creator (1)

Google Suggest (1)

Google Trends (1)

Gravity (1)

GreaseMonkey (1)

HTTP (1)

Hanoi (1)

Hard Drive (1)

Hardware (1)

Health (1)

Holograms (1)

Holon (1)

Housing (1)

Hubris (1)

Hyperion (1)

IBM (1)

IDF (1)

Incredible Machine (1)

Infinite (1)

Invites (1)

Java (1)

Java# (1)

Job (1)

Joel on Software (1)

Joker (1)

LAN Parties (1)

LISP (1)

Lemmings (1)

Les Menuire (1)

Lesbian (1)

Library (1)

Logic (1)

Lorwyn (1)

MD5 (1)

Mario (1)

Martial Arts (1)

Minesweeper (1)

Misconceptions (1)

Mistake (1)

Mp3s (1)

Music (1)

NClassifier (1)

Networking (1)

Obsolete (1)

Off-Piste (1)

Office (1)

OpenOffice (1)

Pagerank (1)

Paradox (1)

Pasha (1)

Patent (1)

PhdComics (1)

Philosophy (1)

Pirates (1)

Poker (1)

Political Correctness (1)

Portal (1)

Prerelease (1)

Prometheus (1)

Psychology (1)

Questions (1)

Rap (1)

ReadToEnd() (1)

Reading Dracula (1)

Recommender (1)

Recovery (1)

Regular Expressions (1)

Restaurant (1)

Risk (1)

RssBandit (1)

SMS (1)

SP3 (1)

SSL (1)

Sea (1)

Self-Assembly (1)

Sequences (1)

Sex (1)

Shared Items (1)

Shopping (1)

Sleepless (1)

Slow (1)

Social (1)

Social Experiements (1)

Soulseek (1)

Space (1)

Spyware (1)

Squirrels (1)

Star Wars (1)

Stories (1)

Stream (1)

StreamReader (1)

Strike (1)

String Theory (1)

StumbleUpon (1)

Swarm (1)

Synesthesia (1)

Syria (1)

Tag Cloud (1)

Tagging (1)

Tao (1)

Technion (1)

Thanks (1)

The Dark Knight (1)

Thermal Take (1)

Thoughts (1)

Threading (1)

Thunderbird (1)

Timeout (1)

TimeoutStream (1)

Trampoline (1)

Transformers (1)

Trip (1)

Trojan (1)

Turbo C (1)

U3 (1)

UI (1)

UN (1)

Unfuddle (1)

Universe (1)

User Generated Content (1)

WaitFor (1)

Web (1)

Web 2.0 (1)

Website (1)

Welcome (1)

WikiLens (1)

Windows (1)

Woman (1)

Word (1)

Work (1)

World Hunger (1)

Yellow Pages (1)

Zen (1)

Zeus (1)

britannica (1)

employment (1)

jsp (1)

that (1)