Diferencia entre revisiones de «LSWC scraping the web/presentacion lwsc 2011»

De WikiEducator

< LSWC scraping the web

Saltar a: navegación, buscar

Revisión de 21:14 5 nov 2011

Contenido

1 Screen Scraping :: Aumentando el poder de la web
- 1.1 1. Intro =
- 1.2 2. Práctica =

Screen Scraping :: Aumentando el poder de la web

Luis Miguel Morillas <lmorillas at xml3k.org>

identi.ca: lmorillas

1. Intro =

¿Por qué hacer scraping?

En la web hay mucha información
No siempre estructurada (opendata)
Web de datos
Divertido

¿Por qué Python?

Crecimiento de los lenguajes dinámicos en la web.
Muchos módulos, herramientas, ejemplos y documentación.
Open-source

Búsqueda "bruta"

import urllib2
 
URL = 'http://www.libresoftwareworldconference.com/'
source = urllib2.urlopen(URL).read()

Proceso del texto
Expresiones regulares

Librerías en Python

Beautiful Soup
mechanize
lxml
html5lib
scrapemark
pyquery
scrapy

...

amara

2. Práctica =

Scraping the web with amara

Obtenido de «https://es.wikieducator.org/index.php?title=LSWC_scraping_the_web/presentacion_lwsc_2011&oldid=6325»