Craigslist Scraper in Python

Craigslist Scraper

The 'Why'

I wanted to be a freelance proofreader, and I realized I could find leads on the writing gigs section of Craigslist. Obviously these jobs (mostly) don't require being local to the clients; I could find clients on any Craigslist subdomain. Checking through every local Craigslist writing gigs section is the perfect task to be automated by software.

The 'How'

I originally created this as a PHP script, using Simple HTML DOM to help with parsing the HTML. Then I lost some data which included that script. So when I went to rebuild it, I did it in Python to "expand my horizons" and used BeautifulSoup4 for HTML parsing. It is written to be run daily as a cron job, and will only look for relevant postings from the day prior.

The Output

Click here to view an example HTML file outputted by the script. Please note that after enough time has passed, none of the links in this page will still be valid Craigslist postings.

The Code

Click here to view the scraper code.

Assumptions & Known Shortcomings

The script assumes that there is a file 'subdomains.txt' in the same directory as it, which lists Craigslist subdomains. It was built under the assumption that there are so few postings in every writing gigs section, that there won't be any that are missed by not navigating to the next page. In the python version of this script, Craigslist subdomains that haven't been encountered before will be added to the ./subdomains.txt listing. They will not, however, be checked during that same running of the script. I could fix this by playing with the for loop controlling subdomain iteration but I've got so many other things I want to program instead.