This short post in to guide you in how to scrape CSV data files. You may ask, why do we need this scrape if those data are already in files? The answer is that you might need to spend quite a lot of time in downloading the files into one place and sorting or merging them.
Python’s CSV library is well able to do a lot of the work for you. Another handy tool is the ScraperWiki toolset and library. So, even if you don’t have much ability in programing, you can adopt a scraper, adjust it for your situation and get data scraped and saved into SQLite database in ScraperWiki for further download. Also you could generate a view from your data scraped.
First of all I got a CSV scrape guide from ScraperWiki: here.
For test purposes I’ve published in the web a simple CSV file; here is the file.
Basic steps
Scrape the target web file into variable ‘data‘ by the url of the file:
import csv
import scraperwiki
url = 'https://docs.google.com/spreadsheet/pub?key=0AmNIZgbwy5TmdENjMmZ2cm5VQXJJMWlQVENIek5Ta2c&output=csv'
data = scraperwiki.scrape(url)
However, in pure Python we need to use ‘urllib2′ library instead of ‘scraperwiki’:
import urllib2
import csv
url = 'https://docs.google.com/spreadsheet/pub?key=0AmNIZgbwy5TmdENjMmZ2cm5VQXJJMWlQVENIek5Ta2c&output=csv'
data = urllib2.urlopen(url)
Split lines from CSV library using ‘splitlines’ method (only in ScraperWiki):
data = data.splitlines()
and put them into a CSV object called ‘reader‘:
reader = csv.DictReader(data)
After that we loop over the lines in the ‘reader‘ object to print and/or save them into database:
for record in reader:
print record
#for scraperwiki only:
scraperwiki.sqlite.save(['Value'], record)
The whole script
import urllib2
import csv
url = 'https://docs.google.com/spreadsheet/pub?key=0AmNIZgbwy5TmdENjMmZ2cm5VQXJJMWlQVENIek5Ta2c&output=csv'
data = urllib2.urlopen(url)
reader = csv.DictReader(data)
for record in reader:
print record
The script for ScraperWiki
import csv
import scraperwiki
url = 'https://docs.google.com/spreadsheet/pub?key=0AmNIZgbwy5TmdENjMmZ2cm5VQXJJMWlQVENIek5Ta2c&output=csv'
data = scraperwiki.scrape(url)
data = data.splitlines()
reader = csv.DictReader(data)
for record in reader:
#record['non-ansii-field'] = record['non-ansii-field'].decode("cp1252")
print record
#for scraperwiki only:
scraperwiki.sqlite.save(['Value'], record)
Click here to fork (copy), adjust and run this scraper. To edit and run scraper just press ‘Edit‘ button on the ScraperWiki dashboard. You might need to create a ScraperWiki account for saving script there.
Storing data in database
For the storing the data, whether in MySql database or in SQLite (ScraperWiki), you need to take heed of data encoding that you scrape. For storing in the database, the most fitting encoding is UTF-8 (obligatory in SQLite of ScraperWiki). Non-ANSII characters might be misencoded when inserted into/retrieved from the database.
If the data source is from Western Europe or the Americas, it might be fitting to decode from ‘cp1252‘ or ‘latin-1‘ encodings directly into UTF-8 encoding. Do it by adding the method .decode(‘<encoding name>’) to the field in question. For example, if the field <name> is in ‘latin-1‘ encoding try to add this line inside the loop prior to storing into database:
record['name'] = record['name'].decode('latin-1')
Read a referrence to non-UTF-8 encodings handling in Python.
Dictionary reader for CSV library
The .DictReader method (instead of .reader) can be used to create a dictionary of your CSV data, the values in the first row of the CSV file will be used as keys. This eliminates a need for naming each field in the code.
Additional tip:
If you want to manually to copy all CSV files into one file in Windows, go to the command line (Start->Run), move to the folder where files are located: cd <path to a folder> and execute the command in console: copy *.csv <name>.csv . Instead of <name> paste a name of the new CSV file.
Source: http://extract-web-data.com/how-to-scrape-csv-data-files/
Python’s CSV library is well able to do a lot of the work for you. Another handy tool is the ScraperWiki toolset and library. So, even if you don’t have much ability in programing, you can adopt a scraper, adjust it for your situation and get data scraped and saved into SQLite database in ScraperWiki for further download. Also you could generate a view from your data scraped.
First of all I got a CSV scrape guide from ScraperWiki: here.
For test purposes I’ve published in the web a simple CSV file; here is the file.
Basic steps
Scrape the target web file into variable ‘data‘ by the url of the file:
import csv
import scraperwiki
url = 'https://docs.google.com/spreadsheet/pub?key=0AmNIZgbwy5TmdENjMmZ2cm5VQXJJMWlQVENIek5Ta2c&output=csv'
data = scraperwiki.scrape(url)
However, in pure Python we need to use ‘urllib2′ library instead of ‘scraperwiki’:
import urllib2
import csv
url = 'https://docs.google.com/spreadsheet/pub?key=0AmNIZgbwy5TmdENjMmZ2cm5VQXJJMWlQVENIek5Ta2c&output=csv'
data = urllib2.urlopen(url)
Split lines from CSV library using ‘splitlines’ method (only in ScraperWiki):
data = data.splitlines()
and put them into a CSV object called ‘reader‘:
reader = csv.DictReader(data)
After that we loop over the lines in the ‘reader‘ object to print and/or save them into database:
for record in reader:
print record
#for scraperwiki only:
scraperwiki.sqlite.save(['Value'], record)
The whole script
import urllib2
import csv
url = 'https://docs.google.com/spreadsheet/pub?key=0AmNIZgbwy5TmdENjMmZ2cm5VQXJJMWlQVENIek5Ta2c&output=csv'
data = urllib2.urlopen(url)
reader = csv.DictReader(data)
for record in reader:
print record
The script for ScraperWiki
import csv
import scraperwiki
url = 'https://docs.google.com/spreadsheet/pub?key=0AmNIZgbwy5TmdENjMmZ2cm5VQXJJMWlQVENIek5Ta2c&output=csv'
data = scraperwiki.scrape(url)
data = data.splitlines()
reader = csv.DictReader(data)
for record in reader:
#record['non-ansii-field'] = record['non-ansii-field'].decode("cp1252")
print record
#for scraperwiki only:
scraperwiki.sqlite.save(['Value'], record)
Click here to fork (copy), adjust and run this scraper. To edit and run scraper just press ‘Edit‘ button on the ScraperWiki dashboard. You might need to create a ScraperWiki account for saving script there.
Storing data in database
For the storing the data, whether in MySql database or in SQLite (ScraperWiki), you need to take heed of data encoding that you scrape. For storing in the database, the most fitting encoding is UTF-8 (obligatory in SQLite of ScraperWiki). Non-ANSII characters might be misencoded when inserted into/retrieved from the database.
If the data source is from Western Europe or the Americas, it might be fitting to decode from ‘cp1252‘ or ‘latin-1‘ encodings directly into UTF-8 encoding. Do it by adding the method .decode(‘<encoding name>’) to the field in question. For example, if the field <name> is in ‘latin-1‘ encoding try to add this line inside the loop prior to storing into database:
record['name'] = record['name'].decode('latin-1')
Read a referrence to non-UTF-8 encodings handling in Python.
Dictionary reader for CSV library
The .DictReader method (instead of .reader) can be used to create a dictionary of your CSV data, the values in the first row of the CSV file will be used as keys. This eliminates a need for naming each field in the code.
Additional tip:
If you want to manually to copy all CSV files into one file in Windows, go to the command line (Start->Run), move to the folder where files are located: cd <path to a folder> and execute the command in console: copy *.csv <name>.csv . Instead of <name> paste a name of the new CSV file.
Source: http://extract-web-data.com/how-to-scrape-csv-data-files/
No comments:
Post a Comment