Cleaning Data with OpenRefine đ Project Guide
Welcome to the VALA Tech Camp 2019 OpenRefine project guide! Congratulations on getting through the teaching part of the workshop. Now itâs time to get stuck into some dodgy data đ
You can download a copy of the workshop slides here. (right-click to save)
Please feel free to ask Alissa or Alexis for help. If you have a question we canât answer (and weâre not experts, so itâs possible) or you get stuck later on, youâre welcome to tweet or email Alissa and sheâll get back to you đŚ
twitter: @lissertations |
email: hello@lissertations.net
Download your data âŹď¸
You have two choices of project data: a MARC file (converted into TSV) and a regular spreadsheet / CSV file. Feel free to play around with either or both! Right-click on these links to save them to your computer.
MARC/TSV file
DOAJ/CSV file
To get these files into OpenRefine:
- Right-click on these links to save the files to your desktop.
- In OpenRefine, make sure youâve selected âCreate Projectâ and âGet data from this computerâ. Click âBrowseâ to locate the file, then click âOpenâ, then âNextâ.
- You shouldnât need to change anything on the next screenâensure OpenRefine is parsing your data as âCSV / TSV / separator-based filesâ. Give your project a name if you like, then click âCreate Projectâ.
If you donât want to try your hand at specific exercises, feel free to have a play around with the different features of OpenRefine. Remember, today we learned how to:
- Import and export data in and out of OpenRefine
- Facet, filter, cluster and edit data
- Transform data using GREL (General Refine Expression Language)
- Reconcile data against an external source (VIAF)
Some of these exercises are the same ones that were demonstrated today. If we can do it, you can do it too đ
MARC data as TSV (marc_data.tsv)
Focused demonstration of how OpenRefine can be used to enhance MARC data
This dataset comprises around fifty random MARC files that Alissa pulled from the NLAâs catalogue, possibly from her giant to-read pile. đ
Ordinarily you would need to use a program like MarcEdit to convert a binary MARC file into a tab-separated values (TSV) file. OpenRefine doesnât deal well with MARC files, so we need to convert it into a format that looks more like a spreadsheet. Terry Reese, the creator fo MarcEdit, has written a workflow on how you can do this.
Many libraries use both MarcEdit and OpenRefine to clean and improve their data. The two programs complement each other, but some things (such as clustering) are better or easier in one program than the otherâitâs a bit of trial and error to determine what might work best for you.
Things to try
Facet by tag
Because of the layout of a MARC file within OpenRefine, itâs often easiest to look at all the instances of a given field at once, eg. all the 650 fields. Open up a text facet (Facet > Text facet) and select the field you want.
Caution! Even if you have faceted by tag before working with the Content column, selecting anything in the âEdit columnâ menu will indeed edit the entire column, not just the faceted data. We may have learned this the hard way đ
Often having two facets can be a great way to spot inconsistencies in your data. For example, put a text facet on the Tags column, then another on the Indicators column. This will show you the indicator combinations for each instance of that particular field. If any shouldnât be there, simply click on the indicator values on the left to pop open a text box, where you can edit them.
Bulk edit headings
Letâs suppose some of these headings are a little out of date, and we want to replace them.
- Put a text facet on the Tags column, and select the 650 field.
- Put another text facet on the Content column. Donât select anything, but have a look at whatâs in the Content facet. You can sort by ânameâ or by âcountâ.
- To edit a heading, hover over it within the Content facet. An âeditâ button will appear. Click on it. Make whatever edits you desire, then click Apply. Bam! Your headings have been magically changed! đŽ
Insert a subfield
Youâll see a lot of these records are AACR2-era records, and are missing some of the more modern fields. Here weâll try inserting a $q into 020 fields that donât have one, where the ISBN and the qualifying information are both in the $a field.
- Put a text facet on the Tags column, and select the 020 field.
- In the Tags column, go to Edit cells > Transform and run the following script:
value.replace(' (','$q(')
- Admire your new subfields! Yay! đ
Fix dodgy UTF-8 encoding
Iâm not sure if there is any in this dataset, but you may find that moving data between various programs and systems can result in special characters going a bit off, and âĂŠââ becomes âĂŠâ, among other things.
To rectify, simply run this transformation on the offending column:
value.reinterpret("utf-8")
Get subfields out of your name headings before clustering
As mentioned, the excellent Cluster and edit and Reconciliation features only work with plain text (that is, text without MARC subfields). This workflow is quite convoluted, and you donât have to finish it today! This is part of a real workflow that Alissa used at work.
To strip the MARC subfields from the 100 field:
- Put a text facet on the Tag column, and select 100 from the list on the left
- If there are $e subfields in here, youâll need to split them out into their own column, to be reattached later. On the Content column, go to Edit cells > Split multi-valued cells, and use â$eâ as the separator. Do not delete column, do not guess cell type. This will create a new âContent 2â column.
- Run the following transformation on the âContentâ column:
value.replace('$a','').replace('$d',' ').replace('$q',' ').replace('$c',' ').split('$e')[0]
- Go to Edit cells > Cluster and edit, and cluster away!
- Adding the subfields back in is slightly convoluted, and there are a couple of things to watch out for:
- Go to Edit cells > Transform, and paste in the following GREL script
'$a' + value
(this adds the $a to the front of the cell)
- Repeat, but with this GREL script
value.replace(' (', '$q(').replace(' 1', '$d1')
(this replaces the $d and $q fields. But not everything needs to be a $q!)
- Letâs look at the $q fields. Run a Text filter and put â$qâ in the box. You can see the only two $q fields that ought to be $c fields are â(Zine creatorâ), so here you can either edit those two manually, or run a script. Letâs run a script >
value.replace('$q(Zine','$c(Zine')
- One of the fields has two $e subfields. Here itâs easier to just manually edit the field. Itâs often a judgment call in OpenRefine as to whether itâs more efficient to write a GREL script, or just edit a couple of things manually. đ
- In the Content 2 column, go to Facet > Customized facets > Facet by blank. This will return a Boolean, or âtrue or falseâ value. (Either âfalse! it is not blank!â or âtrue! there is a value in itâ)
- Now run a transform on the Content column to concatenate the two Content columns, or bring them together into one column:
value + "$e" + cells['Content 2'].value
- Delete Content 2 column as itâs no longer needed: Edit column > Remove this column
- Pat yourself on the back! That was really finicky but you did it, well done đ
DOAJ data as DSV (doaj_data.csv)
Broader demonstration of the features and capabilities of OpenRefine for typical spreadsheets
This dataset (and most of the exercises) are pinched from Library Carpentry. Itâs a slightly messy bibliographic dataset pulled from the Directory of Open Access Journals (DOAJ). OpenRefine is happiest with data in this kind of spreadsheet format, so you can do lots of cool things with this.
Things to try
- Make sure you remove all Facets and Filters
- On the Date column use the dropdown menu to select Edit cells > Common transforms > To date
- Note how the values are now displayed in green and follow a standard convention for their display format (ISO8601) - this indicates they are now stored as date data types in OpenRefine. We can now carry out functions that are specific to Dates
- On the Date column dropdown select Edit column > Add column based on this column. Using this function you can create a new column, while preserving the old column
- In the âNew column nameâ type âFormatted Dateâ
- In the âExpressionâ box type the GREL expression
value.toString("dd MMMM yyyy")
Reconciling publisher names with VIAF IDs
Here weâll use the VIAF Reconciliation service written by Jeff Chiu, via a public service he runs at http://refine.codefork.com/.
- In the Publisher column use the dropdown menu to choose âReconcile > Start Reconcilingâ
- Weâll need to add the VIAF service in manually, as it doesnât come included. Click âAdd Standard ServiceâŚâ and in the dialogue that appears enter âhttp://refine.codefork.com/reconcile/viafâ
- You should now see a heading in the list on the left hand side of the Reconciliation dialogue called âVIAF Reconciliation Serviceâ
- Click on this to choose to use this reconciliation service
- In the middle box in the reconciliation dialogue you may get asked what type of âentityâ you want to reconcile to - that is, what type of thing are you looking for. The list will vary depending on what reconciliation service you are using.
- In this case choose âCorporate Nameâ (it seems like the VIAF Reconciliation Service is slightly intelligent about this and will only offer options that are relevant)
- In the box on the righthand side of the reconciliation dialogue you can choose if other columns are used to help the reconciliation service make a match - however it is sometimes hard to tell what use (if any) the reconciliation service makes of these additional columns
- At the bottom of the reconciliation dialogue there is the option to âAuto-match candidates with high confidenceâ. This can be a time saver, but in this case you are going to uncheck it, so you can see the results before a match is made
- Now click âStart Reconcilingâ
Reconciliation is an operation that can take a few minutes if you have many values to look up. However, in this case there are only 6 publishers to check, so it should work quite quickly.
Once the reconciliation has completed two Facets should be created automatically:
- Publisher: Judgement
- Publisher: best candidateâs score
These are two of several specific reconciliation facets and actions that you can get from the âReconcileâ menu (from the column drop down menu).
- Close the âPublisher: best candidateâs scoreâ facet, but leave the âPublisher: Judgementâ facet open
If you look at the Publisher column, you should see some cells have found one or more matches - the potential matches are show in a list in each cell. Next to each potential match there is a âtickâ and a âdouble tickâ. To accept a reconciliation match you can use the âtickâ options in cells. The âtickâ accepts the match for the single cell, the âdouble tickâ accepts the match for all identical cells.
- Create a text facet on the Publisher column
- Choose âInternational Union of Crystallographyâ
In the Publisher column you should be able to see the various potential matches. Clicking on a match will take you to the VIAF page for that entity.
- Click a âdouble tickâ in one of the Publisher column cells for the option âInternational Union of Crystallographyâ
- This will accept this as a match for all cells - you should see the other options all disappear
- Check the âPublisher: Judgementâ facet. This should now show that 858 items are âmatchedâ (if this does not update, try refreshing the facets)
We could do these one by one, but if we are confident with matches, there is an option to accept all:
- Remove all filters/facets from the project so all rows display
- In the Publisher column use the dropdown menu to choose âReconcile > Actions > Match each cell to its best candidateâ
There are two things that reconciliation can do for you. Firstly it gets a standard form of the name or label for the entity. Secondly it gets an ID for the entity - in this case a VIAF id. This is hidden in the default view, but can be extracted:
- In the Publisher column use the dropdown menu to choose âEdit column > Add column based on this columnâŚâ
- Give the column the name âVIAF IDâ
- In the GREL expression box type
cell.recon.match.id
- This will create a new column that contains the VIAF ID for the matched entity.
Further reading đ