Note: This article is intended for advanced users of Import.io. If you are not familiar with HTML, then you will first want to start with an introduction to HTML followed by an introduction to XPath.
This article demonstrates how to tackle the following problems using XPath:
- Part I: Match and Capture
- When tags have no identifying class values for the extractor to train on
- Part II: Row XPath
- The extractor is unable to properly set the rows
- Part III: Navigating Out of Row XPath
- Grabbing data outside of the Row XPath when Row XPath is manually set
- Part IV: Match On Column Index
- When the column position changes from one URL to another and there’s no distinguishing class for the extractor to train on
We recommend installing a Google Chrome extension called XPath Helper, a tool designed to help write XPaths live in the browser which you can find here.
Part I: Match and Capture
When there’s no identifying class values for the extractor to train on, we can grab the data by matching and capturing. We would match the field name and then capture the following data. This is commonly seen with specification tables. We’ll use the following URL: http://shop.panasonic.com/cameras-and-camcorders/cameras/lumix-interchangeable-lens-ilc-cameras/DC-G9KBODY.html#start=1&cgid=cameras
Step 1 - Inspect the Page
We’ll inspect the feature, subcategory, and specification or the following highlighted fields:
Here, we want to target the feature “TYPE,” the subcategory “Image sensor size,” and finally the data matching “Image sensor data.”
Step 2 - Match the Text
Open up XPath Helper to write the XPath. Here we matched the feature “TYPE” and the subcategory “Image sensor size” by using the contains function:
//span[contains(.,"TYPE")]/following-sibling::div[1]//li[contains(.,"Image sensor size")]
Step 3 - Capture the Data
Now that we have matched the fields we need, we can capture the data by navigating to the data’s tag, in this case <span>, and then indexing to the following tag:
//span[contains(.,"TYPE")]/following-sibling::div[1]//li[contains(.,"Image sensor size")]/span[2]
Step 4 - Set Manual XPath in the Extractor
Copy the XPath from Step 3 and navigate to the extractor. Select the column settings for “Image Sensor Size,” click on “Use manual XPath,” and paste the XPath from Step 3 in the textbox that appears below the column bar:
//span[contains(.,"TYPE")]/following-sibling::div[1]//li[contains(.,"Image sensor size")]/span[2]
Part II: Row XPath
When an extractor is unable to properly set the row, Row XPath allows us to manually define how to set and separate the records we need.
For this tutorial, we’re going to use the following URL: http://www.lightingproducts.philips.com/our-brands/lightolier-usa/corepro-led-downlight.html#!f=%2b%40Category%3aDownlighting%2b%40SubCategory%3aGeneral+Purpose+Downlighting
Step 1 - Find the Row XPath
Inspect the structure of the webpage to find the XPath of each record.
We can see that “Product Name,” “Downlight Types,” “Lumens,” “Aperture Size,” and “Product Description” are all under a <table> tag with an attribute of “Tableau” and each product are under a <tr> tag with a class attribute of “Product.” Here, the <tr> tag will be our target.
Step 2 - Write the Row XPath
Open up XPath Helper to help write the Row XPath of each record.
Here we grabbed the <table> tag with the class attribute “Tableau,” followed by the <tbody> tag, followed by the <tr> tag with the class attribute of “Product” or the following:
//table[@class=”Tableau”]/tbody/tr[@class=”Product”]
Step 3 - Set the Row XPath in the Extractor
Feed the URL into the extractor. For this example, we turned off CSS to view the hidden data that we need.
To set the Row XPath, select “Rows,” set this to “Multiple Rows,” and click on “Row XPath.” Paste the following XPath we wrote into the textbox:
//table[@class=”Tableau”]/tbody/tr[@class=”Product”]
Step 4 - Point-and-Click
After setting the Row XPath, the point-and-click process becomes a breeze. Here you’ll see the extractor detected all of the products after selecting just a single item.
Part III: Navigating Out of Row XPath
If data that we need is outside of the table where we defined our Row XPath, we can navigate out of Row XPath to grab that data. We’ll use the same example from Part II to demonstrate how to grab the Family name (CorePro LED Downlight).
**Part III requires going through Part II first**
Step 1 - Find the XPath of the Data
Inspect the data we’re targeting and open up XPath Helper to write the XPath or the following:
//h1
Step 2 - Set Manual XPath of the Column in the Extractor and Navigate Out
Copy the XPath we wrote in Step 2 and navigate back to the extractor. Select the column settings for “Family,” click on “Use manual XPath,” and paste the XPath from Step 2 in the textbox that appears below the column bar:
//h1
From here, navigate out of the Row XPath. The Manual XPath should look something like this:
../../../../..//h1
Part IV: Match On Column Index
When columns show up in no particular order in a table or have no distinguishing classes, setting Manual XPath for these columns ensures the accuracy of grabbing the data. For instance, in this example, “Lumens” is the third column of the table, however, in a different URL, it may be the fourth column. We’ll use the same example from Part II and Part III.
**Part IV requires going through Part II first**
Step 1 - Match the Column Text
First, we’ll write the XPath to match the column we want with the contains function. We’ll use “Lumens” for this example.
//table[@class=”Tableau”/tbody/tr/td[@class=”CelluleTitre” and contains(.,”Lumens”)]
Step 2 - Find the Column Index Using the Count Function
To get find Lumen’s index, we’ll grab all preceding siblings before Lumens or the following:
//table[@class=”Tableau”/tbody/tr/td[@class=”CelluleTitre” and contains(.,”Lumens”)]/preceding-sibling::*
Using the count function, we can find the index of our current XPath (Catalog Number + Downlight Types):
count(//table[@class="Tableau"]/tbody/tr/td[@class="CelluleTitre" and contains(.,"Lumens")]/preceding-sibling::*)
We then need to add the index by one to include the position of “Lumens” or the following:
count(//table[@class="Tableau"]/tbody/tr/td[@class="CelluleTitre" and contains(.,"Lumens")]/preceding-sibling::*)+1
Step 3 - Use the Position Function to Output the Data
Now that we have the index of the column, we can now grab the tag of the data to get an output. We see that the data in this example is in a <td> tag so we’ll use the position function with our XPath from Step 2. It should look something like this:
td[position()=count(//table[@class="Tableau"]/tbody/tr/td[@class="CelluleTitre" and contains(.,"Lumens")]/preceding-sibling::*)+1]
Navigate to the extractor, select the column settings for Lumens, and click on “Use manual XPath,” and paste the XPath we wrote in the textbox that appears below the column bar:
td[position()=count(//table[@class="Tableau"]/tbody/tr/td[@class="CelluleTitre" and contains(.,"Lumens")]/preceding-sibling::*)+1]
Comments
0 comments
Please sign in to leave a comment.