GeekNights Monday - Web Scraping

Tonight on GeekNights, after a surprisingly long hiatus between Rym going to Europe for work and Rym & Emily biking 480 miles across the state of New York, we talk a bunch about IPv6 woes with Intel NICs, and happenings in the street before we get to the news that companies shouldn't spy on their employees, and the GTX 4090 is coming. Eventually, we talk a bit about web scraping and why it is noble and just.

Things of the Day

Episode Links

I once wrote a web scraper for this forum to suck in all the movie reviews I’d posted in the movie thread.

2 Likes

There’s an API you can use that would be much easier.

I think it was the previous forum.

Direct access to a db - this is exactly what datasette is. Stuff any sqlite db into it, and you get a slick web UI for querying/sharing.
One of the demos linked at that website is US legislators:
https://congress-legislators.datasettes.com/legislators

You can download the sqlite db itself, but it’s really designed for sharing: straight-up run SQL queries against the db, and copy/paste the link. For example, the number of legislators from each party a state has elected, since 1956:

https://congress-legislators.datasettes.com/legislators?sql=select+count(*)%2C+state%2C+party+from+legislator_terms+where+start+%3E%3D+1956+group+by+state%2C+party

There’s a plugin ecosystem too, I especially like the mapping and graphing ones.

Sure, but it’s not like major web sites are just going to give you an SQLite download of their database.

The new frontrowcrew.com will though, eventually.

1 Like

Some do! Stackoverflow, wikipedia, PyPI.