Prerequisites:
Recommended for those who have completed the 2-part Python Fundamentals series or who have basic Python skills and familiarity with data structures like dictionaries and lists. Prior experience with HTML may be helpful but isnot required.
Overview:
A vast amount of data for research and analysis is stored in unstructured formats across the web. This workshop introduces participants to the essential techniques for "web scraping"—the process of programmatically extracting data from websites.
Using two Python libraries, Requests and Beautiful Soup, we will walk through a scraping project: from sending HTTP requests to a server to parsing HTML and saving the final output into a structured format like a CSV file.
Topics Include:
- The Web Workflow: Understanding how the Requests library fetches webpage content.
- HTML Fundamentals: A brief introduction to HTML tags, attributes, and the Document Object Model (DOM) structure.
- Parsing Content: Navigating the HTML tree and searching for specific elements using BeautifulSoup.
- Data Extraction: Techniques for isolating text, links, and tables from messy webpage data.
- Ethics and Best Practices: Discussion on robots.txt files, scraping etiquette, and legal considerations.
Contact information:
Please email mrsmith@rice.edu if you have questions about the Data@Rice workshop series.
For more information about Python and other data courses at Rice, visit https://library.rice.edu/services/data-workshops, or contact researchdata@rice.edu.
Course Materials: