Web Scraping with Python

Prerequisites:

Recommended for those who have completed the 2-part Python Fundamentals series or who have basic Python skills and familiarity with data structures like dictionaries and lists. Prior experience with HTML may be helpful but isnot required.

Overview:

A vast amount of data for research and analysis is stored in unstructured formats across the web. This workshop introduces participants to the essential techniques for "web scraping"—the process of programmatically extracting data from websites.

Using two Python libraries, Requests and Beautiful Soup, we will walk through a scraping project: from sending HTTP requests to a server to parsing HTML and saving the final output into a structured format like a CSV file.

Topics Include:

  • The Web Workflow: Understanding how the Requests library fetches webpage content.
  • HTML Fundamentals: A brief introduction to HTML tags, attributes, and the Document Object Model (DOM) structure.
  • Parsing Content: Navigating the HTML tree and searching for specific elements using BeautifulSoup.
  • Data Extraction: Techniques for isolating text, links, and tables from messy webpage data.
  • Ethics and Best Practices: Discussion on robots.txt files, scraping etiquette, and legal considerations.

Contact information:

Please email mrsmith@rice.edu if you have questions about the Data@Rice workshop series.

For more information about Python and other data courses at Rice, visit https://library.rice.edu/services/data-workshops, or contact researchdata@rice.edu

Course Materials:

titles_short.csv

Date/Time
-
Location
Fondren B43A (Collaboration Space)
Registration Form
Academic Affiliation
Academic Role
For example, Fondren Calendar of Events, Email blast, Events@Rice, Fondren digital signage, word of mouth